http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Arashwan&feedformat=atomstatwiki - User contributions [US]2022-01-19T05:08:10ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Generative_Stochastic_Networks_Trainable_by_Backprop&diff=27337deep Generative Stochastic Networks Trainable by Backprop2015-12-16T20:02:08Z<p>Arashwan: </p>
<hr />
<div>= Introduction =<br />
<br />
[[File:figure_1_bengio.png |thumb|upright=1.75| Figure 1 Top: <ref>Bengio, Yoshua, Mesnil, Gregoire, Dauphin, Yann, and ´<br />
Rifai, Salah. Better mixing via deep representations. In<br />
ICML’13, 2013b. </ref> A denoising auto-encoder defines an estimated Markov chain where the transition operator first samples a corrupted <math>\bar{X}</math> from <math>C(\bar{X}|X)</math> and then samples a reconstruction<br />
from <math>P_o(X|\bar{X})</math>, which is trained to estimate the ground truth <math>P(X|\bar{X})</math><br />
. Note how for any given <math>\bar{X}</math> is a much<br />
simpler (roughly unimodal) distribution than the ground truth<br />
P(X) and its partition function is thus easier to approximate.<br />
Bottom: More generally, a GSN allows the use of arbitrary latent<br />
variables H in addition to X, with the Markov chain state (and<br />
mixing) involving both X and H. Here H is the angle about<br />
the origin. The GSN inherits the benefit of a simpler conditional<br />
and adds latent variables, which allow far more powerful deep<br />
representations in which mixing is easier]]<br />
<br />
The Deep Learning boom that has been seen in recent years was spurred initially by research in unsupervised learning techniques.<ref><br />
Bengio, Yoshua. Learning deep architectures for AI. Now<br />
Publishers, 2009.</ref>However, most of the major successes over the last few years have mostly been based on supervised techniques. A drawback for the unsupervised methods stems from their need for too many calculations and intractable sums in their models (inference, learning, sampling and partition functions). The paper presented puts forth an idea for a network that creates a model of a conditional distribution, <math>P(X|\bar{X})</math>, which can be seen as a local (usually) unimodal representation of <math>P(X)</math>. <math>\bar{X}</math> is a corrupted example of the original data <math>{X}</math>. The Generative Stochastic Network (GSN) combines arbitrary latent variables <math>H</math> that serve as input for a Markov chain which build in layers that eventually create a representation of the original data. Training of the network does not need Gibb's sampling or large partition functions but is trained with backpropagation and all the tools that come with it. <br />
<br />
In DBM <ref> Salakhutdinov, Ruslan and Hinton, Geoffrey E. Deep<br />
Boltzmann machines. In AISTATS’2009, pp. 448–455,<br />
2009 </ref>, sampling <math>P(x, h)</math> is estimated based on inference and sampling (contrastive divergence algorithm). To obtain a gradient there are intractable sums that must to calculated, however there are ways around this. The problem with these methods is that they make strong assumptions. In essence, the sampling methods for these calculations are biased towards certain distribution types (i.e. small number of modes). The attempt is to get around this. <br />
<br />
The reasoning for wanting to have a tractable generative model that uses unsupervised training is that within the realm of data, there is a far greater amount of unlabelled data than labelled data. Future models should be able to take advantage of this information.<br />
<br />
= Generative Stochastic Network (GSN) = <br />
[[File:figure_2_bengio.png |thumb|left|upright=2| Figure 2 Left: Generic GSN Markov chain with state variables Xt and Ht. Right: GSN Markov chain inspired by the unfolded<br />
computational graph of the Deep Boltzmann Machine Gibbs sampling process, but with backprop-able stochastic units at each layer.<br />
The training example X = x0 starts the chain. Either odd or even layers are stochastically updated at each step. All xt’s are corrupted by<br />
salt-and-pepper noise before entering the graph (lightning symbol). Each xt for t > 0 is obtained by sampling from the reconstruction<br />
distribution for that step <math>P_{\theta2}(Xt|Ht)</math>,. The walkback training objective is the sum over all steps of log-likelihoods of target X = x0<br />
under the reconstruction distribution. In the special case of a unimodal Gaussian reconstruction distribution, maximizing the likelihood<br />
is equivalent to minimizing reconstruction error; in general one trains to maximum likelihood, not simply minimum reconstruction error]]<br />
<br />
The paper describes the Generative Stochastic Network as a generalization of generative denoising autoencoders. This can be said as the estimations of the data are based on noised sampling. As opposed to directly estimating the data distribution, the model ventures to parametrize the transition of a Markov chain. This is the change that allows the problem to be transformed into a problem more similar to a supervised training problem. GSN relies on estimating the transition operator of a Markov chain, that is <math>P(x_t | x_{t-1})</math> or <math>P(x_t, h_t|x_{t-1}, h_{t-1})</math>, which contain a small number of important modes. This leads to a simple gradient of a partition function. Tries to leverage the strength of function approximation. GSN parametrizes the transition operators of Markov chain rather than <math>P(X)</math>. Allows for training of unsupervised methods by gradient descent and maximum likelihood with no partition functions, just back-propagation.<br />
<br />
The estimation of <math>P(X)</math> is as follows: create <math>\bar{X}</math> from corrupted distribution <math>C(\bar{X}|X)</math>. <math>C</math> is created by adding some type of noise to the original data. The model is then trained to reconstruct <math>X</math> from <math>\bar{X}</math> and thus obtain <math>P(X|\bar{X})</math>. This is easier to model then the whole of <math>P(X)</math> since <math>P(X|\bar{X})</math> is dominated by fewer modes than <math>P(X)</math>. Bayes rule then dictates that <math>P(X|\bar{X}) = \frac{1}{z}C(\bar{X}|X)P(X)</math>, z is an independent normalizing constant. This leads to the ability to construct <math>P(X)</math> based off the other two distributions. <br />
<br />
Using a parametrized model (i.e. a neural network) it was found that the approximation made by the model, <math>P_{\theta}(X|\bar{X})</math> could be used to approximate <math>P_{\theta}(X)</math>. The Markov chain distribution <math>\pi(X)</math> will eventually converge to <math>P(X)</math>. Figure 2 shows this process. <br />
<br />
One may wonder where the complexity of the original data distribution went?! If <math>P_{\theta}(X|\bar{X})</math> and <math>C(\bar{X}|X)</math> are not complex, then how can they model the complex distribution <math>P(X)</math>? They explain that even though <math>P_{\theta}(X|\bar{X})</math> has few modes, the location of the modes is dependent on <math>\bar{X}</math>. Since the estimation is based off of many values of <math>\bar{X}</math> and a mapping of <math>\bar{X}</math> to a mode location that allows the problem to become a supervised function approximation problem (which is easy).<br />
<br />
Training the GSN involves moving along a Markov chain that uses the transition distribution between nodes as a way to update the weights of the GSM. The transition distribution <math>f(h,h', x)</math> is trained to maximize reconstruction likelihood. The following picture demonstrates the Markov chain that allows for the training of the model. Note the similarities to Hinton's contrastive divergence.<br />
<br />
[[File:bengio_markov.png |centre|]]<br />
<br />
<br />
<br />
= Experimental Results =<br />
Some initial experimental results were created without extensive parameter alteration. This was done to maintain consistency over the tests and likely to show that even without optimization that the results approached the performance of more established unsupervised learning networks. The main comparison was made to Deep Boltzmann Machines (DBM) and Deep Belief Networks (DBN). <br />
<br />
=== MNIST ===<br />
<br />
The non-linearity for the units in the GSN was applied as <math display="block"> h_i = \eta_{out} + \tanh (\eta_{in} + a_i) </math>, with <math>a_i</math> as the linear activation for unit <math>i</math> and <math>\eta_{in}</math> and <math>\eta_{out}</math> are both zero mean Gaussian noise. Sampling of unfinished or incomplete data can be done in a similar manner to DBM, where representations can propagate upwards and downwards in the network. This allows for pattern completion similar to that achieved by DBM. The third image in Figure 3 demonstrates the GSN's ability to move from only half an image (where the rest is noise) and complete the digit, showing it has a internal representation of the digit that can be sampled to complete the digit. <br />
<br />
<br />
[[File:figure_3_bengio.png |thumb|centre|upright=2| Figure 3 Top: two runs of consecutive samples (one row after the<br />
other) generated from 2-layer GSN model, showing fast mixing<br />
between classes, nice and sharp images. Note: only every fourth<br />
sample is shown; see the supplemental material for the samples<br />
in between. Bottom: conditional Markov chain, with the right<br />
half of the image clamped to one of the MNIST digit images and<br />
the left half successively resampled, illustrating the power of the<br />
generative model to stochastically fill-in missing inputs.]]<br />
<br />
=== Faces ===<br />
<br />
The following figure shows the GSN's ability to perform facial reconstruction. <br />
<br />
[[File:figure_4_bengio.png |thumb | upright=2|centre | Figure 4 GSN samples from a 3-layer model trained on the TFD<br />
dataset. Every second sample is shown; see supplemental material<br />
for every sample. At the end of each row, we show the nearest<br />
example from the training set to the last sample on that row, to illustrate<br />
that the distribution is not merely copying the training set.]]<br />
<br />
<br />
=== Comparison ===<br />
Test set log-likelihood lower bound (LL) obtained by<br />
a Parzen density estimator constructed using 10000 generated<br />
samples, for different generative models trained on MNIST.<br />
The LL is not directly comparable to AIS likelihood estimates<br />
because we use a Gaussian mixture rather than a Bernoulli<br />
mixture to compute the likelihood. A DBN-2 has 2 hidden layers, a (Contrastive Autoencoder) CAE-1<br />
has 1 hidden layer, and a CAE-2 has 2. The (Denoising Autoencoder)DAE is basically a<br />
GSN-1, with no injection of noise inside the network.<br />
<br />
[[File:GSN_comparison.png]]<br />
<ref>Rifai, Salah, Bengio, Yoshua, Dauphin, Yann, and Vincent,<br />
Pascal. A generative process for sampling contractive<br />
auto-encoders. In ICML’12, 2012</ref><br />
<br />
= Conclusions and Critique =<br />
The main objective of the paper and technique was to avoid the intractable aspects of traditional generative models. This was achieved by training a model to reconstruct noisy data, which created a local and simple approximation of the whole data distribution. This was done over and over, treated as a Markov chain, with each transition distribution corresponding to a new representation of the data distribution. This can be trained with supervised neural network tools. Experiments shows similarity between results from the GSN and the DBM. However, there is no need for layer wise pre-training on the GSN. <br />
<br />
One critique for this paper is that they continually point out that there method should, in theory, be faster than the traditional models. They show that a similar model can achieve similar results but they do not provide any information on the time each network took to train. This could be done by having networks with approximately the same numbers of parameters train for a specific task and be timed and evaluated based upon that. <br />
The paper does not do a very good job of describing how the training is done in relation to the Markov chain. The relationship can be teased out eventually, though it is not immediately apparent and could have been elaborated upon further. <br />
There is one section that briefly glosses over Sum Product Networks (SPN) as an alternative tractable graphical model. Since the SPN are solving the same problem that they are proposing to solve, it would have made sense for them to evaluate their model compared to the SPN as well, however they failed to do this.<br />
<br />
= References =<br />
<references></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27336deep Sparse Rectifier Neural Networks2015-12-16T19:58:04Z<p>Arashwan: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Machine learning scientists and computational neuroscientists deal with neural networks differently. Machine learning scientists aim to obtain models that are easy to train and easy to generalize, while neuroscientists' objective is to produce useful representation of the scientific data. In other words, machine learning scientists care more about efficiency, while neuroscientists care more about interpretability of the model.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability (because the input is represented in a higher-dimensional space) and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Advantages of rectified linear units ==<br />
<br />
The rectifier activation function <math>\,max(0, x)</math> allows a network to easily obtain sparse representations since only a subset of hidden units will have a non-zero activation value for some given input and this sparsity can be further increased through regularization methods. Therefore, the rectified linear activation function will utilize the advantages listed in the previous section for sparsity.<br />
<br />
For a given input, only a subset of hidden units in each layer will have non-zero activation values. The rest of the hidden units will have zero and they are essentially turned off. Each hidden unit activation value is then composed of a linear combination of the active (non-zero) hidden units in the previous layer due to the linearity of the rectified linear function. By repeating this through each layer, one can see that the neural network is actually an exponentially increasing number of linear models who share parameters since the later layers will use the same values from the earlier layers. Since the network is linear, the gradient is easy to calculate and compute and travels back through the active nodes without vanishing gradient problem caused by non-linear sigmoid or tanh functions. <br />
<br />
The sparsity and linear model can be seen in the figure the researchers made:<br />
<br />
[[File:RLU.PNG]]<br />
<br />
Each layer is a linear combination of the previous layer.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used. Also, if symmetry is required, this can be obtained by using two rectifier units with shared parameters, but requires twice as many hidden units as a network with a symmetric activation function.<br />
<br />
Finally, rectifier networks are subject to ill conditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
For image recognition task, they find that there is almost no improvement when using unsupervised pre-training with rectifier activations, contrary to what is experienced using tanh or softplus. However, it achieves best performance when the network is trained Without unsupervised pre-training.<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27335on using very large target vocabulary for neural machine translation2015-12-16T19:08:37Z<p>Arashwan: /* Overview */</p>
<hr />
<div>==Overview==<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<br />
<ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref><br />
The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly, they are limited to work with a small vocabulary because of complexity and the number of parameters that have to be trained. To explain, the output layer of an RNN used for machine translation will have as many units as there are items in the vocabulary. If the vocabulary has hundreds of thousand of terms, then the RNN must compute a very expensive softmax on the output units at each time step when predicting an output sequence. Moreover, the number of parameters in the RNN will also grow very large in such cases, given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a non-trivially sized hidden layer, a large vocabulary could result in tens of millions of model parameters just associated with the hidden-to-output mapping performed by the model. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary. <br />
<br />
In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed. The algorithm is tested on two machine translation tasks (English <math>\rightarrow</math> German, and English <math>\rightarrow</math> French), and it achieved the best performance achieved by any previous single neural machine translation (NMT) system on the English <math>\rightarrow</math> French translation task.<br />
<br />
==Methods==<br />
<br />
Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto \exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>\, z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>\, c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=\arg\max\sum_{n=1}^{N}\sum_{t=1}^{T_n}\log p(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_t^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector <math>c_t</math> as a convex sum of the hidden states <math>(h_1,...,h_T)</math> with the coefficients <math>(\alpha_1,...,\alpha_T)</math> computed by<br />
<br />
<math>\alpha_t=\frac{\exp\{a(h_t, z_t)\}}{\sum_{k}\exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} \exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}\exp\left(W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\right)</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. <br />
The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=\log p(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>\,w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
<br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
==Setting==<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
==Results==<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table below. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
{| class="wikitable"<br />
|-<br />
! Method <br />
! CPU i7-4820k<br />
! GPU GTX TITAN black<br />
|-<br />
| RNNsearch<br />
| 0.09 s<br />
| 0.02 s<br />
|-<br />
| RNNsearch-LV <br />
| 0.80 s<br />
| 0.25 s<br />
|-<br />
| RNNsearch-LV<br />
+Candidate list<br />
| 0.12 s<br />
| 0.0.05 s<br />
|}<br />
<br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. <br />
<br />
<br />
==Conclusion==<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27334very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-16T18:49:18Z<p>Arashwan: /* Conv.Net Configurations */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper<ref><br />
Simonyan, Karen, and Andrew Zisserman. [http://arxiv.org/pdf/1409.1556.pdf "Very deep convolutional networks for large-scale image recognition."] arXiv preprint arXiv:1409.1556 (2014).</ref> the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the only preprocessing step is to subtract the mean RBG value computed on the training data. Then, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3 with a convolutional stride of 1 pixel. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
= Classification Framework =<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
===Training===<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively fewer epochs to converge due to the following reasons:<br />
(a) implicit regularization imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialization of certain layers.<br />
<br />
With respect to (b) above, the shallowest configuration (A in the previous table) was trained using random initialization. For all the other configurations, the first four convolutional layers and the last 3 fully connected layers were initialized with the corresponding parameters from A, to avoid getting stuck during training due to a bad initialization. All other layers were randomly initialized by sampling from a normal distribution with 0 mean.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
===Implementation===<br />
<br />
To improve overall training speed of each model, the researchers introduced parallelization to the mini batch gradient descent process. Since the model is very deep, training on a single GPU would take months to finish. To speed up the process, the researchers trained separate batches of images on each GPU in parallel to calculate the gradients. For example, with 4 GPUs, the model would take 4 batches of images, calculate their separate gradients and then finally take an average of four sets of gradients as training. (Krizhevsky et al., 2012) introduced more complicated ways to parallelize training convolutional neural networks but the researchers found that this simple configuration speed up training process by a factor of 3.75 with 4 GPUs and with a possible maximum of 4, the simple configuration worked well enough. <br />
Finally, it took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
===Testing===<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
= Classification Experiments =<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
== Single-Scale Evaluation ==<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
== Multi-Scale Evaluation ==<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Comparison With The State Of The Art ==<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
= Appendix A: Localization =<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Details and more results on these competitions can be found here.<ref><br />
Russakovsky, Olga, et al. [http://arxiv.org/pdf/1409.0575v3.pdf "Imagenet large scale visual recognition challenge."] International Journal of Computer Vision (2014): 1-42.<br />
</ref> They also showed that their configuration is applicable to some other datasets.<br />
<br />
= Resources =<br />
<br />
The Oxford Visual Geometry Group (VGG) has released code for their 16-layer and 19-layer models. The code is available on their [http://www.robots.ox.ac.uk/~vgg/research/very_deep/ website] in the format used by the [http://caffe.berkeleyvision.org/ Caffe] toolbox and includes the weights of the pretrained networks.<br />
<br />
=References=<br />
<references /><br />
<br />
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27333very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-16T18:41:26Z<p>Arashwan: /* Conv.Net Configurations */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper<ref><br />
Simonyan, Karen, and Andrew Zisserman. [http://arxiv.org/pdf/1409.1556.pdf "Very deep convolutional networks for large-scale image recognition."] arXiv preprint arXiv:1409.1556 (2014).</ref> the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the only preprocessing step is to subtract the mean RBG value computed on the training data. Then, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
= Classification Framework =<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
===Training===<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively fewer epochs to converge due to the following reasons:<br />
(a) implicit regularization imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialization of certain layers.<br />
<br />
With respect to (b) above, the shallowest configuration (A in the previous table) was trained using random initialization. For all the other configurations, the first four convolutional layers and the last 3 fully connected layers were initialized with the corresponding parameters from A, to avoid getting stuck during training due to a bad initialization. All other layers were randomly initialized by sampling from a normal distribution with 0 mean.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
===Implementation===<br />
<br />
To improve overall training speed of each model, the researchers introduced parallelization to the mini batch gradient descent process. Since the model is very deep, training on a single GPU would take months to finish. To speed up the process, the researchers trained separate batches of images on each GPU in parallel to calculate the gradients. For example, with 4 GPUs, the model would take 4 batches of images, calculate their separate gradients and then finally take an average of four sets of gradients as training. (Krizhevsky et al., 2012) introduced more complicated ways to parallelize training convolutional neural networks but the researchers found that this simple configuration speed up training process by a factor of 3.75 with 4 GPUs and with a possible maximum of 4, the simple configuration worked well enough. <br />
Finally, it took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
===Testing===<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
= Classification Experiments =<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
== Single-Scale Evaluation ==<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
== Multi-Scale Evaluation ==<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Comparison With The State Of The Art ==<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
= Appendix A: Localization =<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Details and more results on these competitions can be found here.<ref><br />
Russakovsky, Olga, et al. [http://arxiv.org/pdf/1409.0575v3.pdf "Imagenet large scale visual recognition challenge."] International Journal of Computer Vision (2014): 1-42.<br />
</ref> They also showed that their configuration is applicable to some other datasets.<br />
<br />
= Resources =<br />
<br />
The Oxford Visual Geometry Group (VGG) has released code for their 16-layer and 19-layer models. The code is available on their [http://www.robots.ox.ac.uk/~vgg/research/very_deep/ website] in the format used by the [http://caffe.berkeleyvision.org/ Caffe] toolbox and includes the weights of the pretrained networks.<br />
<br />
=References=<br />
<references /><br />
<br />
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=extracting_and_Composing_Robust_Features_with_Denoising_Autoencoders&diff=27332extracting and Composing Robust Features with Denoising Autoencoders2015-12-16T17:28:50Z<p>Arashwan: /* The Denoising Autoencoder */</p>
<hr />
<div>= Introduction =<br />
This Paper explores a new training principle for unsupervised learning<br />
of a representation based on the idea of making the learned representations<br />
robust to partial corruption of the input pattern. This approach can<br />
be used to train autoencoders, and these denoising autoencoders can be<br />
stacked to initialize deep architectures. The algorithm can be motivated<br />
from a manifold learning and information theoretic perspective or from a<br />
generative model perspective.<br />
The proposed system is similar to a standard auto-encoder, which is trained with the objective function to learn a hidden representation which allows it to reconstruct its input. The difference between these two models is that the model is trained to reconstruct the original input from a corrupted version, generated by adding random noise to the data. This will result in extracting useful features.<br />
== Motivation ==<br />
<br />
The approach is based on the use of an unsupervised<br />
training criterion to perform a layer-by-layer initialization. The procedure is as follows :<br />
Each layer is at first trained to produce a higher level (hidden) representation of the observed patterns,<br />
based on the representation it receives as input from the layer below, by<br />
optimizing a local unsupervised criterion. Each level produces a representation<br />
of the input pattern that is more abstract than the previous level’s, because it<br />
is obtained by composing more operations. This initialization yields a starting<br />
point, from which a global fine-tuning of the model’s parameters is then performed<br />
using another training criterion appropriate for the task at hand.<br />
<br />
This process gives better solutions than the one obtained by random initializations<br />
<br />
= The Denoising Autoencoder =<br />
<br />
A Denoising Autoencoder reconstructs<br />
a clean “repaired” input from a corrupted, partially destroyed one. This<br />
is done by first corrupting the initial input <math>x</math> to get a partially destroyed version<br />
<math>\tilde{x}</math> by means of a stochastic mapping. This means<br />
that the autoencoder must learn to compute a representation<br />
that is informative of the original input even<br />
when some of its elements are missing. This technique<br />
was inspired by the ability of humans to have an appropriate<br />
understanding of their environment even in<br />
situations where the available information is incomplete<br />
(e.g. when looking at an object that is partly<br />
occluded). In this paper the noise is added by randomly zeroing a fixed number, <math>v_d</math>, of components and leaving the rest untouched. This is similar to salt noise in images where we see random white background areas in an image.<br />
<br />
As shown in the figure below, the clean input <math>x</math> is mapped to some corrupted version according to some conditional distribution <math>q_D(\sim{x}|x)</math>. The corrupted version is then mapped to some informative domain <math>y</math>, and the autoencoder then attempts to reconstruct the clean version <math>x</math> from <math>y</math>. Thus the objective function can be described as<br />
[[File:W1.png]]<br />
<br />
The objective function minimized by<br />
stochastic gradient descent becomes: <br />
[[File:W2.png]]<br />
<br />
where the loss function is the cross entropy of the model<br />
The denoising autoencoder can be shown in the figure as <br />
<br />
[[File:W3.png]]<br />
<br />
It is important to note that usually the dimensionality of the hidden layer needs to be less than the input/output layer in order to avoid the trivial solution of identity mapping, but in this case that is not a problem since randomly zeroing out numbers causes the identity map to fail. This forces the network to learn a more abstract representation of the data regardless of the relative sizes of the layers.<br />
<br />
= Layer-wise Initialization and Fine Tuning =<br />
<br />
While training the denoising autoencoder k-th layer used as<br />
input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been<br />
trained. After a few layers have been trained, the parameters are used as initialization<br />
for a network optimized with respect to a supervised training criterion.<br />
This greedy layer-wise procedure has been shown to yield significantly better<br />
local minima than random initialization of deep networks,<br />
achieving better generalization on a number of tasks.<br />
<br />
= Analysis of the Denoising Autoencoder =<br />
== Manifold Learning Perspective ==<br />
<br />
<br />
The process of mapping a corrupted example to an uncorrupted one can be<br />
visualized in Figure 2, with a low-dimensional manifold <math>\mathcal{M}</math> near which the data<br />
concentrate. We learn a stochastic operator <math>p(X|\tilde{X})</math> that maps an <math>\tilde{X}</math> to an <math>X\,</math>.<br />
<br />
<br />
[[File:q4.png]]<br />
<br />
Since the corrupted points <math>\tilde{X}</math> will likely not be on <math>\mathcal{M}</math>, the learned map <math>p(X|\tilde{X})</math> is able to determine how to transform points away from <math>\mathcal{M}</math> into points on <math>\mathcal{M}</math>.<br />
<br />
The denoising autoencoder can thus be seen as a way to define and learn a<br />
manifold. The intermediate representation <math>Y = f(X)</math> can be interpreted as a<br />
coordinate system for points on the manifold (this is most clear if we force the<br />
dimension of <math>Y</math> to be smaller than the dimension of <math>X</math>). More generally, one can<br />
think of <math>Y = f(X)</math> as a representation of <math>X</math> which is well suited to capture the<br />
main variations in the data, i.e., on the manifold. When additional criteria (such<br />
as sparsity) are introduced in the learning model, one can no longer directly view<br />
<math>Y = f(X)</math> as an explicit low-dimensional coordinate system for points on the<br />
manifold, but it retains the property of capturing the main factors of variation<br />
in the data.<br />
<br />
== Stochastic Operator Perspective ==<br />
<br />
The denoising autoencoder can also be seen as corresponding to a semi-parametric model that can be sampled from. Define the joint distribution as follows: <br />
<br />
:<math>p(X, \tilde{X}) = p(\tilde{X}) p(X|\tilde{X}) = q^0(\tilde{X}) p(X|\tilde{X}) </math> <br />
<br />
from the stochastic operator <math>p(X | \tilde{X})</math>, with <math>q^0\,</math> being the empirical distribution.<br />
<br />
Using the Kullback-Leibler divergence, defined as:<br />
<br />
:<math>\mathbb{D}_{KL}(p|q) = \mathbb{E}_{p(X)} \left(\log\frac{p(X)}{q(X)}\right) </math><br />
<br />
then minimizing <math>\mathbb{D}_{KL}(q^0(X, \tilde{X}) | p(X, \tilde{X})) </math> yields the originally-formulated denoising criterion. Furthermore, as this objective is minimized, the marginals of <math>\,p</math> approach those of <math>\,q^0</math>, i.e. <math> p(X) \rightarrow q^0(X)</math>. Then, if <math>\,p</math> is expanded in the following way:<br />
<br />
:<math> p(X) = \frac{1}{n}\sum_{i=1}^n \sum_{\tilde{\mathbf{x}}} p(X|\tilde{X} = \tilde{\mathbf{x}}) q_{\mathcal{D}}(\tilde{\mathbf{x}} | \mathbf{x}_i) </math><br />
<br />
it becomes clear that the denoising autoencoder learns a semi-parametric model that can be sampled from (since <math>p(X)</math> above is easy to sample from). <br />
<br />
== Information Theoretic Perspective ==<br />
<br />
It is also possible to adopt an information theoretic perspective. The representation of the autonencoder should retain as much information as possible while at the same time certain properties, like a limited complexity, are imposed on the marginal distribution. This can be expressed as an optimization of <math>\arg\max_{\theta} \{I(X;Y) + \lambda \mathcal{J}(Y)\}</math> where <math>I(X; Y)</math> is the mutual information between an input sample <math>X</math> and the hidden representation <math>Y</math> and <math>\mathcal{J}</math> is a functional expressing the preference over the marginal. The hyper-parameter <math>\lambda</math> controls the trade-off between maximazing the mutual information and keeping the marginal simple.<br />
<br />
Note that this reasoning also applies to the basic autoencoder, but the denoising autoencoder maximizes the mutual information between <math>X</math> and <math>Y</math> while <math>Y</math> can also be a function of corrupted input.<br />
<br />
== Generative Model Perspective ==<br />
<br />
This section tries to recover the training criterion for denoising autoencoder. The section of 'information theoretic Perspective' is equivalent to maximizing a variational bound on a particular generative model. The final training criterion found is to maximize <math> \bold E_{q^0(\tilde{x})}[L(q^0, \tilde{X})] </math>, where <math> L(q^0, \tilde{X}) = E_{q^0(X,Y | \tilde{X})}[log\frac{p(X, \tilde{X}, Y)}{q^0(X, Y | \tilde(X))}] </math><br />
<br />
= Experiments =<br />
The Input contains different<br />
variations of the MNIST digit classification problem, with added factors of<br />
variation such as rotation (rot), addition of a background composed of random<br />
pixels (bg-rand) or made from patches extracted from a set of images (bg-img), or<br />
combinations of these factors (rot-bg-img). These variations render the problems particularly challenging for current generic learning algorithms. Each problem<br />
is divided into a training, validation, and test set (10000, 2000, 50000 examples<br />
respectively). A subset of the original MNIST problem is also included with the<br />
same example set sizes (problem basic). The benchmark also contains additional<br />
binary classification problems: discriminating between convex and non-convex<br />
shapes (convex), and between wide and long rectangles (rect, rect-img).<br />
Neural networks with 3 hidden layers initialized by stacking denoising autoencoders<br />
(SdA-3), and fine tuned on the classification tasks, were evaluated<br />
on all the problems in this benchmark. Model selection was conducted following<br />
a similar procedure as Larochelle et al. (2007). Several values of hyper<br />
parameters (destruction fraction ν, layer sizes, number of unsupervised training<br />
epochs) were tried, combined with early stopping in the fine tuning phase. For<br />
each task, the best model was selected based on its classification performance<br />
on the validation set.<br />
The results can be reported in the following table.<br />
[[File:W5.png]]<br />
<br />
The filter obtained by training are shown the the figure below<br />
<br />
<br />
[[File:Qq3.png]]<br />
<br />
= Conclusion and Future Work =<br />
<br />
The paper shows a denoising Autoencoder which was motivated by the goal of<br />
learning representations of the input that are robust to small irrelevant changes<br />
in input. Several perspectives also help to motivate it from a manifold learning<br />
perspective and from the perspective of a generative model.<br />
This principle can be used to train and stack autoencoders to initialize a<br />
deep neural network. A series of image classification experiments were performed<br />
to evaluate this new training principle. The empirical results support<br />
the following conclusions: unsupervised initialization of layers with an explicit<br />
denoising criterion helps to capture interesting structure in the input distribution.<br />
This in turn leads to intermediate representations much better suited for<br />
subsequent learning tasks such as supervised classification. The experimental<br />
results with Deep Belief Networks (whose layers are initialized as RBMs) suggest<br />
that RBMs may also encapsulate a form of robustness in the representations<br />
they learn, possibly because of their stochastic nature, which introduces noise<br />
in the representation during training.<br />
<br />
= References =<br />
<br />
Bengio, Y. (2007). Learning deep architectures for AI (Technical Report 1312).<br />
Universit´e de Montr´eal, dept. IRO.<br />
<br />
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layerwise<br />
training of deep networks. Advances in Neural Information Processing<br />
Systems 19 (pp. 153–160). MIT Press.<br />
<br />
Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In<br />
L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel<br />
machines. MIT Press.<br />
<br />
Doi, E., Balcan, D. C., & Lewicki, M. S. (2006). A theoretical analysis of<br />
robust coding over noisy overcomplete channels. In Y. Weiss, B. Sch¨olkopf<br />
and J. Platt (Eds.), Advances in neural information processing systems 18,<br />
307–314. Cambridge, MA: MIT Press.<br />
<br />
Doi, E., & Lewicki, M. S. (2007). A theory of retinal population coding. NIPS<br />
(pp. 353–360). MIT Press.<br />
<br />
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant<br />
representations over learned dictionaries. IEEE Transactions on Image Processing,<br />
15, 3736–3745.<br />
<br />
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires<br />
associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette<br />
<br />
Hammond, D., & Simoncelli, E. (2007). A machine learning framework for adaptive<br />
combination of signal denoising methods. 2007 International Conference<br />
on Image Processing (pp. VI: 29–32).<br />
<br />
Hinton, G. (1989). Connectionist learning procedures. Artificial Intelligence,<br />
40, 185–234.<br />
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data<br />
with neural networks. Science, 313, 504–507.<br />
<br />
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for<br />
deep belief nets. Neural Computation, 18, 1527–1554.<br />
<br />
Hopfield, J. (1982). Neural networks and physical systems with emergent collective<br />
computational abilities. Proceedings of the National Academy of Sciences,<br />
USA, 79.<br />
<br />
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007).<br />
An empirical evaluation of deep architectures on problems with many factors<br />
of variation. Twenty-fourth International Conference on Machine Learning<br />
(ICML’2007).<br />
<br />
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Doctoral dissertation,<br />
Universit´e de Paris VI.<br />
<br />
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual<br />
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in<br />
neural information processing systems 20. Cambridge, MA: MIT Press.<br />
<br />
McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel<br />
distributed processing: Explorations in the microstructure of cognition, vol. 2.<br />
Cambridge: MIT Press.<br />
<br />
Memisevic, R. (2007). Non-linear latent factor models for revealing structure<br />
in high-dimensional data. Doctoral dissertation, Departement of Computer<br />
Science, University of Toronto, Toronto, Ontario, Canada.<br />
<br />
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for<br />
deep belief networks. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.),<br />
Advances in neural information processing systems 20. Cambridge, MA: MIT<br />
Press.<br />
<br />
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning<br />
of sparse representations with an energy-based model. Advances in Neural<br />
Information Processing Systems (NIPS 2006). MIT Press.<br />
<br />
Roth, S., & Black, M. (2005). Fields of experts: a framework for learning image<br />
priors. IEEE Conference on Computer Vision and Pattern Recognition (pp.<br />
860–867).</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=extracting_and_Composing_Robust_Features_with_Denoising_Autoencoders&diff=27331extracting and Composing Robust Features with Denoising Autoencoders2015-12-16T17:18:58Z<p>Arashwan: /* The Denoising Autoencoder */</p>
<hr />
<div>= Introduction =<br />
This Paper explores a new training principle for unsupervised learning<br />
of a representation based on the idea of making the learned representations<br />
robust to partial corruption of the input pattern. This approach can<br />
be used to train autoencoders, and these denoising autoencoders can be<br />
stacked to initialize deep architectures. The algorithm can be motivated<br />
from a manifold learning and information theoretic perspective or from a<br />
generative model perspective.<br />
The proposed system is similar to a standard auto-encoder, which is trained with the objective function to learn a hidden representation which allows it to reconstruct its input. The difference between these two models is that the model is trained to reconstruct the original input from a corrupted version, generated by adding random noise to the data. This will result in extracting useful features.<br />
== Motivation ==<br />
<br />
The approach is based on the use of an unsupervised<br />
training criterion to perform a layer-by-layer initialization. The procedure is as follows :<br />
Each layer is at first trained to produce a higher level (hidden) representation of the observed patterns,<br />
based on the representation it receives as input from the layer below, by<br />
optimizing a local unsupervised criterion. Each level produces a representation<br />
of the input pattern that is more abstract than the previous level’s, because it<br />
is obtained by composing more operations. This initialization yields a starting<br />
point, from which a global fine-tuning of the model’s parameters is then performed<br />
using another training criterion appropriate for the task at hand.<br />
<br />
This process gives better solutions than the one obtained by random initializations<br />
<br />
= The Denoising Autoencoder =<br />
<br />
A Denoising Autoencoder reconstructs<br />
a clean “repaired” input from a corrupted, partially destroyed one. This<br />
is done by first corrupting the initial input <math>x</math> to get a partially destroyed version<br />
<math>\tilde{x}</math> by means of a stochastic mapping. This means<br />
that the autoencoder must learn to compute a representation<br />
that is informative of the original input even<br />
when some of its elements are missing. This technique<br />
was inspired by the ability of humans to have an appropriate<br />
understanding of their environment even in<br />
situations where the available information is incomplete<br />
(e.g. when looking at an object that is partly<br />
occluded). In this paper the noise is added by randomly zeroing a fixed number, <math>v_d</math>, of components and leaving the rest untouched. This is similar to salt noise in images where we see random white background areas in an image.<br />
<br />
Thus the objective function can be described as<br />
[[File:W1.png]]<br />
<br />
The objective function minimized by<br />
stochastic gradient descent becomes: <br />
[[File:W2.png]]<br />
<br />
where the loss function is the cross entropy of the model<br />
The denoising autoencoder can be shown in the figure as <br />
<br />
[[File:W3.png]]<br />
<br />
It is important to note that usually the dimensionality of the hidden layer needs to be less than the input/output layer in order to avoid the trivial solution of identity mapping, but in this case that is not a problem since randomly zeroing out numbers causes the identity map to fail. This forces the network to learn a more abstract representation of the data regardless of the relative sizes of the layers.<br />
<br />
= Layer-wise Initialization and Fine Tuning =<br />
<br />
While training the denoising autoencoder k-th layer used as<br />
input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been<br />
trained. After a few layers have been trained, the parameters are used as initialization<br />
for a network optimized with respect to a supervised training criterion.<br />
This greedy layer-wise procedure has been shown to yield significantly better<br />
local minima than random initialization of deep networks,<br />
achieving better generalization on a number of tasks.<br />
<br />
= Analysis of the Denoising Autoencoder =<br />
== Manifold Learning Perspective ==<br />
<br />
<br />
The process of mapping a corrupted example to an uncorrupted one can be<br />
visualized in Figure 2, with a low-dimensional manifold <math>\mathcal{M}</math> near which the data<br />
concentrate. We learn a stochastic operator <math>p(X|\tilde{X})</math> that maps an <math>\tilde{X}</math> to an <math>X\,</math>.<br />
<br />
<br />
[[File:q4.png]]<br />
<br />
Since the corrupted points <math>\tilde{X}</math> will likely not be on <math>\mathcal{M}</math>, the learned map <math>p(X|\tilde{X})</math> is able to determine how to transform points away from <math>\mathcal{M}</math> into points on <math>\mathcal{M}</math>.<br />
<br />
The denoising autoencoder can thus be seen as a way to define and learn a<br />
manifold. The intermediate representation <math>Y = f(X)</math> can be interpreted as a<br />
coordinate system for points on the manifold (this is most clear if we force the<br />
dimension of <math>Y</math> to be smaller than the dimension of <math>X</math>). More generally, one can<br />
think of <math>Y = f(X)</math> as a representation of <math>X</math> which is well suited to capture the<br />
main variations in the data, i.e., on the manifold. When additional criteria (such<br />
as sparsity) are introduced in the learning model, one can no longer directly view<br />
<math>Y = f(X)</math> as an explicit low-dimensional coordinate system for points on the<br />
manifold, but it retains the property of capturing the main factors of variation<br />
in the data.<br />
<br />
== Stochastic Operator Perspective ==<br />
<br />
The denoising autoencoder can also be seen as corresponding to a semi-parametric model that can be sampled from. Define the joint distribution as follows: <br />
<br />
:<math>p(X, \tilde{X}) = p(\tilde{X}) p(X|\tilde{X}) = q^0(\tilde{X}) p(X|\tilde{X}) </math> <br />
<br />
from the stochastic operator <math>p(X | \tilde{X})</math>, with <math>q^0\,</math> being the empirical distribution.<br />
<br />
Using the Kullback-Leibler divergence, defined as:<br />
<br />
:<math>\mathbb{D}_{KL}(p|q) = \mathbb{E}_{p(X)} \left(\log\frac{p(X)}{q(X)}\right) </math><br />
<br />
then minimizing <math>\mathbb{D}_{KL}(q^0(X, \tilde{X}) | p(X, \tilde{X})) </math> yields the originally-formulated denoising criterion. Furthermore, as this objective is minimized, the marginals of <math>\,p</math> approach those of <math>\,q^0</math>, i.e. <math> p(X) \rightarrow q^0(X)</math>. Then, if <math>\,p</math> is expanded in the following way:<br />
<br />
:<math> p(X) = \frac{1}{n}\sum_{i=1}^n \sum_{\tilde{\mathbf{x}}} p(X|\tilde{X} = \tilde{\mathbf{x}}) q_{\mathcal{D}}(\tilde{\mathbf{x}} | \mathbf{x}_i) </math><br />
<br />
it becomes clear that the denoising autoencoder learns a semi-parametric model that can be sampled from (since <math>p(X)</math> above is easy to sample from). <br />
<br />
== Information Theoretic Perspective ==<br />
<br />
It is also possible to adopt an information theoretic perspective. The representation of the autonencoder should retain as much information as possible while at the same time certain properties, like a limited complexity, are imposed on the marginal distribution. This can be expressed as an optimization of <math>\arg\max_{\theta} \{I(X;Y) + \lambda \mathcal{J}(Y)\}</math> where <math>I(X; Y)</math> is the mutual information between an input sample <math>X</math> and the hidden representation <math>Y</math> and <math>\mathcal{J}</math> is a functional expressing the preference over the marginal. The hyper-parameter <math>\lambda</math> controls the trade-off between maximazing the mutual information and keeping the marginal simple.<br />
<br />
Note that this reasoning also applies to the basic autoencoder, but the denoising autoencoder maximizes the mutual information between <math>X</math> and <math>Y</math> while <math>Y</math> can also be a function of corrupted input.<br />
<br />
== Generative Model Perspective ==<br />
<br />
This section tries to recover the training criterion for denoising autoencoder. The section of 'information theoretic Perspective' is equivalent to maximizing a variational bound on a particular generative model. The final training criterion found is to maximize <math> \bold E_{q^0(\tilde{x})}[L(q^0, \tilde{X})] </math>, where <math> L(q^0, \tilde{X}) = E_{q^0(X,Y | \tilde{X})}[log\frac{p(X, \tilde{X}, Y)}{q^0(X, Y | \tilde(X))}] </math><br />
<br />
= Experiments =<br />
The Input contains different<br />
variations of the MNIST digit classification problem, with added factors of<br />
variation such as rotation (rot), addition of a background composed of random<br />
pixels (bg-rand) or made from patches extracted from a set of images (bg-img), or<br />
combinations of these factors (rot-bg-img). These variations render the problems particularly challenging for current generic learning algorithms. Each problem<br />
is divided into a training, validation, and test set (10000, 2000, 50000 examples<br />
respectively). A subset of the original MNIST problem is also included with the<br />
same example set sizes (problem basic). The benchmark also contains additional<br />
binary classification problems: discriminating between convex and non-convex<br />
shapes (convex), and between wide and long rectangles (rect, rect-img).<br />
Neural networks with 3 hidden layers initialized by stacking denoising autoencoders<br />
(SdA-3), and fine tuned on the classification tasks, were evaluated<br />
on all the problems in this benchmark. Model selection was conducted following<br />
a similar procedure as Larochelle et al. (2007). Several values of hyper<br />
parameters (destruction fraction ν, layer sizes, number of unsupervised training<br />
epochs) were tried, combined with early stopping in the fine tuning phase. For<br />
each task, the best model was selected based on its classification performance<br />
on the validation set.<br />
The results can be reported in the following table.<br />
[[File:W5.png]]<br />
<br />
The filter obtained by training are shown the the figure below<br />
<br />
<br />
[[File:Qq3.png]]<br />
<br />
= Conclusion and Future Work =<br />
<br />
The paper shows a denoising Autoencoder which was motivated by the goal of<br />
learning representations of the input that are robust to small irrelevant changes<br />
in input. Several perspectives also help to motivate it from a manifold learning<br />
perspective and from the perspective of a generative model.<br />
This principle can be used to train and stack autoencoders to initialize a<br />
deep neural network. A series of image classification experiments were performed<br />
to evaluate this new training principle. The empirical results support<br />
the following conclusions: unsupervised initialization of layers with an explicit<br />
denoising criterion helps to capture interesting structure in the input distribution.<br />
This in turn leads to intermediate representations much better suited for<br />
subsequent learning tasks such as supervised classification. The experimental<br />
results with Deep Belief Networks (whose layers are initialized as RBMs) suggest<br />
that RBMs may also encapsulate a form of robustness in the representations<br />
they learn, possibly because of their stochastic nature, which introduces noise<br />
in the representation during training.<br />
<br />
= References =<br />
<br />
Bengio, Y. (2007). Learning deep architectures for AI (Technical Report 1312).<br />
Universit´e de Montr´eal, dept. IRO.<br />
<br />
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layerwise<br />
training of deep networks. Advances in Neural Information Processing<br />
Systems 19 (pp. 153–160). MIT Press.<br />
<br />
Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In<br />
L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel<br />
machines. MIT Press.<br />
<br />
Doi, E., Balcan, D. C., & Lewicki, M. S. (2006). A theoretical analysis of<br />
robust coding over noisy overcomplete channels. In Y. Weiss, B. Sch¨olkopf<br />
and J. Platt (Eds.), Advances in neural information processing systems 18,<br />
307–314. Cambridge, MA: MIT Press.<br />
<br />
Doi, E., & Lewicki, M. S. (2007). A theory of retinal population coding. NIPS<br />
(pp. 353–360). MIT Press.<br />
<br />
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant<br />
representations over learned dictionaries. IEEE Transactions on Image Processing,<br />
15, 3736–3745.<br />
<br />
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires<br />
associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette<br />
<br />
Hammond, D., & Simoncelli, E. (2007). A machine learning framework for adaptive<br />
combination of signal denoising methods. 2007 International Conference<br />
on Image Processing (pp. VI: 29–32).<br />
<br />
Hinton, G. (1989). Connectionist learning procedures. Artificial Intelligence,<br />
40, 185–234.<br />
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data<br />
with neural networks. Science, 313, 504–507.<br />
<br />
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for<br />
deep belief nets. Neural Computation, 18, 1527–1554.<br />
<br />
Hopfield, J. (1982). Neural networks and physical systems with emergent collective<br />
computational abilities. Proceedings of the National Academy of Sciences,<br />
USA, 79.<br />
<br />
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007).<br />
An empirical evaluation of deep architectures on problems with many factors<br />
of variation. Twenty-fourth International Conference on Machine Learning<br />
(ICML’2007).<br />
<br />
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Doctoral dissertation,<br />
Universit´e de Paris VI.<br />
<br />
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual<br />
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in<br />
neural information processing systems 20. Cambridge, MA: MIT Press.<br />
<br />
McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel<br />
distributed processing: Explorations in the microstructure of cognition, vol. 2.<br />
Cambridge: MIT Press.<br />
<br />
Memisevic, R. (2007). Non-linear latent factor models for revealing structure<br />
in high-dimensional data. Doctoral dissertation, Departement of Computer<br />
Science, University of Toronto, Toronto, Ontario, Canada.<br />
<br />
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for<br />
deep belief networks. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.),<br />
Advances in neural information processing systems 20. Cambridge, MA: MIT<br />
Press.<br />
<br />
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning<br />
of sparse representations with an energy-based model. Advances in Neural<br />
Information Processing Systems (NIPS 2006). MIT Press.<br />
<br />
Roth, S., & Black, M. (2005). Fields of experts: a framework for learning image<br />
priors. IEEE Conference on Computer Vision and Pattern Recognition (pp.<br />
860–867).</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=extracting_and_Composing_Robust_Features_with_Denoising_Autoencoders&diff=27330extracting and Composing Robust Features with Denoising Autoencoders2015-12-16T17:16:35Z<p>Arashwan: /* The Denoising Autoencoder */</p>
<hr />
<div>= Introduction =<br />
This Paper explores a new training principle for unsupervised learning<br />
of a representation based on the idea of making the learned representations<br />
robust to partial corruption of the input pattern. This approach can<br />
be used to train autoencoders, and these denoising autoencoders can be<br />
stacked to initialize deep architectures. The algorithm can be motivated<br />
from a manifold learning and information theoretic perspective or from a<br />
generative model perspective.<br />
The proposed system is similar to a standard auto-encoder, which is trained with the objective function to learn a hidden representation which allows it to reconstruct its input. The difference between these two models is that the model is trained to reconstruct the original input from a corrupted version, generated by adding random noise to the data. This will result in extracting useful features.<br />
== Motivation ==<br />
<br />
The approach is based on the use of an unsupervised<br />
training criterion to perform a layer-by-layer initialization. The procedure is as follows :<br />
Each layer is at first trained to produce a higher level (hidden) representation of the observed patterns,<br />
based on the representation it receives as input from the layer below, by<br />
optimizing a local unsupervised criterion. Each level produces a representation<br />
of the input pattern that is more abstract than the previous level’s, because it<br />
is obtained by composing more operations. This initialization yields a starting<br />
point, from which a global fine-tuning of the model’s parameters is then performed<br />
using another training criterion appropriate for the task at hand.<br />
<br />
This process gives better solutions than the one obtained by random initializations<br />
<br />
= The Denoising Autoencoder =<br />
<br />
A Denoising Autoencoder reconstructs<br />
a clean “repaired” input from a corrupted, partially destroyed one. This<br />
is done by first corrupting the initial input <math>x</math> to get a partially destroyed version<br />
<math>\tilde{x}</math> by means of a stochastic mapping. This means<br />
that the autoencoder must learn to compute a representation<br />
that is informative of the original input even<br />
when some of its elements are missing. This technique<br />
was inspired by the ability of humans to have an appropriate<br />
understanding of their environment even in<br />
situations where the available information is incomplete<br />
(e.g. when looking at an object that is partly<br />
occluded). In this paper the noise is added by randomly zeroing a fixed number, <math>v_d</math>, of components and leaving the rest untouched. This is similar to salt noise in images.<br />
<br />
Thus the objective function can be described as<br />
[[File:W1.png]]<br />
<br />
The objective function minimized by<br />
stochastic gradient descent becomes: <br />
[[File:W2.png]]<br />
<br />
where the loss function is the cross entropy of the model<br />
The denoising autoencoder can be shown in the figure as <br />
<br />
[[File:W3.png]]<br />
<br />
It is important to note that usually the dimensionality of the hidden layer needs to be less than the input/output layer in order to avoid the trivial solution of identity mapping, but in this case that is not a problem since randomly zeroing out numbers causes the identity map to fail. This forces the network to learn a more abstract representation of the data regardless of the relative sizes of the layers.<br />
<br />
= Layer-wise Initialization and Fine Tuning =<br />
<br />
While training the denoising autoencoder k-th layer used as<br />
input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been<br />
trained. After a few layers have been trained, the parameters are used as initialization<br />
for a network optimized with respect to a supervised training criterion.<br />
This greedy layer-wise procedure has been shown to yield significantly better<br />
local minima than random initialization of deep networks,<br />
achieving better generalization on a number of tasks.<br />
<br />
= Analysis of the Denoising Autoencoder =<br />
== Manifold Learning Perspective ==<br />
<br />
<br />
The process of mapping a corrupted example to an uncorrupted one can be<br />
visualized in Figure 2, with a low-dimensional manifold <math>\mathcal{M}</math> near which the data<br />
concentrate. We learn a stochastic operator <math>p(X|\tilde{X})</math> that maps an <math>\tilde{X}</math> to an <math>X\,</math>.<br />
<br />
<br />
[[File:q4.png]]<br />
<br />
Since the corrupted points <math>\tilde{X}</math> will likely not be on <math>\mathcal{M}</math>, the learned map <math>p(X|\tilde{X})</math> is able to determine how to transform points away from <math>\mathcal{M}</math> into points on <math>\mathcal{M}</math>.<br />
<br />
The denoising autoencoder can thus be seen as a way to define and learn a<br />
manifold. The intermediate representation <math>Y = f(X)</math> can be interpreted as a<br />
coordinate system for points on the manifold (this is most clear if we force the<br />
dimension of <math>Y</math> to be smaller than the dimension of <math>X</math>). More generally, one can<br />
think of <math>Y = f(X)</math> as a representation of <math>X</math> which is well suited to capture the<br />
main variations in the data, i.e., on the manifold. When additional criteria (such<br />
as sparsity) are introduced in the learning model, one can no longer directly view<br />
<math>Y = f(X)</math> as an explicit low-dimensional coordinate system for points on the<br />
manifold, but it retains the property of capturing the main factors of variation<br />
in the data.<br />
<br />
== Stochastic Operator Perspective ==<br />
<br />
The denoising autoencoder can also be seen as corresponding to a semi-parametric model that can be sampled from. Define the joint distribution as follows: <br />
<br />
:<math>p(X, \tilde{X}) = p(\tilde{X}) p(X|\tilde{X}) = q^0(\tilde{X}) p(X|\tilde{X}) </math> <br />
<br />
from the stochastic operator <math>p(X | \tilde{X})</math>, with <math>q^0\,</math> being the empirical distribution.<br />
<br />
Using the Kullback-Leibler divergence, defined as:<br />
<br />
:<math>\mathbb{D}_{KL}(p|q) = \mathbb{E}_{p(X)} \left(\log\frac{p(X)}{q(X)}\right) </math><br />
<br />
then minimizing <math>\mathbb{D}_{KL}(q^0(X, \tilde{X}) | p(X, \tilde{X})) </math> yields the originally-formulated denoising criterion. Furthermore, as this objective is minimized, the marginals of <math>\,p</math> approach those of <math>\,q^0</math>, i.e. <math> p(X) \rightarrow q^0(X)</math>. Then, if <math>\,p</math> is expanded in the following way:<br />
<br />
:<math> p(X) = \frac{1}{n}\sum_{i=1}^n \sum_{\tilde{\mathbf{x}}} p(X|\tilde{X} = \tilde{\mathbf{x}}) q_{\mathcal{D}}(\tilde{\mathbf{x}} | \mathbf{x}_i) </math><br />
<br />
it becomes clear that the denoising autoencoder learns a semi-parametric model that can be sampled from (since <math>p(X)</math> above is easy to sample from). <br />
<br />
== Information Theoretic Perspective ==<br />
<br />
It is also possible to adopt an information theoretic perspective. The representation of the autonencoder should retain as much information as possible while at the same time certain properties, like a limited complexity, are imposed on the marginal distribution. This can be expressed as an optimization of <math>\arg\max_{\theta} \{I(X;Y) + \lambda \mathcal{J}(Y)\}</math> where <math>I(X; Y)</math> is the mutual information between an input sample <math>X</math> and the hidden representation <math>Y</math> and <math>\mathcal{J}</math> is a functional expressing the preference over the marginal. The hyper-parameter <math>\lambda</math> controls the trade-off between maximazing the mutual information and keeping the marginal simple.<br />
<br />
Note that this reasoning also applies to the basic autoencoder, but the denoising autoencoder maximizes the mutual information between <math>X</math> and <math>Y</math> while <math>Y</math> can also be a function of corrupted input.<br />
<br />
== Generative Model Perspective ==<br />
<br />
This section tries to recover the training criterion for denoising autoencoder. The section of 'information theoretic Perspective' is equivalent to maximizing a variational bound on a particular generative model. The final training criterion found is to maximize <math> \bold E_{q^0(\tilde{x})}[L(q^0, \tilde{X})] </math>, where <math> L(q^0, \tilde{X}) = E_{q^0(X,Y | \tilde{X})}[log\frac{p(X, \tilde{X}, Y)}{q^0(X, Y | \tilde(X))}] </math><br />
<br />
= Experiments =<br />
The Input contains different<br />
variations of the MNIST digit classification problem, with added factors of<br />
variation such as rotation (rot), addition of a background composed of random<br />
pixels (bg-rand) or made from patches extracted from a set of images (bg-img), or<br />
combinations of these factors (rot-bg-img). These variations render the problems particularly challenging for current generic learning algorithms. Each problem<br />
is divided into a training, validation, and test set (10000, 2000, 50000 examples<br />
respectively). A subset of the original MNIST problem is also included with the<br />
same example set sizes (problem basic). The benchmark also contains additional<br />
binary classification problems: discriminating between convex and non-convex<br />
shapes (convex), and between wide and long rectangles (rect, rect-img).<br />
Neural networks with 3 hidden layers initialized by stacking denoising autoencoders<br />
(SdA-3), and fine tuned on the classification tasks, were evaluated<br />
on all the problems in this benchmark. Model selection was conducted following<br />
a similar procedure as Larochelle et al. (2007). Several values of hyper<br />
parameters (destruction fraction ν, layer sizes, number of unsupervised training<br />
epochs) were tried, combined with early stopping in the fine tuning phase. For<br />
each task, the best model was selected based on its classification performance<br />
on the validation set.<br />
The results can be reported in the following table.<br />
[[File:W5.png]]<br />
<br />
The filter obtained by training are shown the the figure below<br />
<br />
<br />
[[File:Qq3.png]]<br />
<br />
= Conclusion and Future Work =<br />
<br />
The paper shows a denoising Autoencoder which was motivated by the goal of<br />
learning representations of the input that are robust to small irrelevant changes<br />
in input. Several perspectives also help to motivate it from a manifold learning<br />
perspective and from the perspective of a generative model.<br />
This principle can be used to train and stack autoencoders to initialize a<br />
deep neural network. A series of image classification experiments were performed<br />
to evaluate this new training principle. The empirical results support<br />
the following conclusions: unsupervised initialization of layers with an explicit<br />
denoising criterion helps to capture interesting structure in the input distribution.<br />
This in turn leads to intermediate representations much better suited for<br />
subsequent learning tasks such as supervised classification. The experimental<br />
results with Deep Belief Networks (whose layers are initialized as RBMs) suggest<br />
that RBMs may also encapsulate a form of robustness in the representations<br />
they learn, possibly because of their stochastic nature, which introduces noise<br />
in the representation during training.<br />
<br />
= References =<br />
<br />
Bengio, Y. (2007). Learning deep architectures for AI (Technical Report 1312).<br />
Universit´e de Montr´eal, dept. IRO.<br />
<br />
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layerwise<br />
training of deep networks. Advances in Neural Information Processing<br />
Systems 19 (pp. 153–160). MIT Press.<br />
<br />
Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In<br />
L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel<br />
machines. MIT Press.<br />
<br />
Doi, E., Balcan, D. C., & Lewicki, M. S. (2006). A theoretical analysis of<br />
robust coding over noisy overcomplete channels. In Y. Weiss, B. Sch¨olkopf<br />
and J. Platt (Eds.), Advances in neural information processing systems 18,<br />
307–314. Cambridge, MA: MIT Press.<br />
<br />
Doi, E., & Lewicki, M. S. (2007). A theory of retinal population coding. NIPS<br />
(pp. 353–360). MIT Press.<br />
<br />
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant<br />
representations over learned dictionaries. IEEE Transactions on Image Processing,<br />
15, 3736–3745.<br />
<br />
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires<br />
associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette<br />
<br />
Hammond, D., & Simoncelli, E. (2007). A machine learning framework for adaptive<br />
combination of signal denoising methods. 2007 International Conference<br />
on Image Processing (pp. VI: 29–32).<br />
<br />
Hinton, G. (1989). Connectionist learning procedures. Artificial Intelligence,<br />
40, 185–234.<br />
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data<br />
with neural networks. Science, 313, 504–507.<br />
<br />
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for<br />
deep belief nets. Neural Computation, 18, 1527–1554.<br />
<br />
Hopfield, J. (1982). Neural networks and physical systems with emergent collective<br />
computational abilities. Proceedings of the National Academy of Sciences,<br />
USA, 79.<br />
<br />
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007).<br />
An empirical evaluation of deep architectures on problems with many factors<br />
of variation. Twenty-fourth International Conference on Machine Learning<br />
(ICML’2007).<br />
<br />
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Doctoral dissertation,<br />
Universit´e de Paris VI.<br />
<br />
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual<br />
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in<br />
neural information processing systems 20. Cambridge, MA: MIT Press.<br />
<br />
McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel<br />
distributed processing: Explorations in the microstructure of cognition, vol. 2.<br />
Cambridge: MIT Press.<br />
<br />
Memisevic, R. (2007). Non-linear latent factor models for revealing structure<br />
in high-dimensional data. Doctoral dissertation, Departement of Computer<br />
Science, University of Toronto, Toronto, Ontario, Canada.<br />
<br />
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for<br />
deep belief networks. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.),<br />
Advances in neural information processing systems 20. Cambridge, MA: MIT<br />
Press.<br />
<br />
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning<br />
of sparse representations with an energy-based model. Advances in Neural<br />
Information Processing Systems (NIPS 2006). MIT Press.<br />
<br />
Roth, S., & Black, M. (2005). Fields of experts: a framework for learning image<br />
priors. IEEE Conference on Computer Vision and Pattern Recognition (pp.<br />
860–867).</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26653dropout2015-11-19T20:58:31Z<p>Arashwan: /* Model */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p can be set using a validation set, or can be set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math> , where <math> f </math> is the activation function.<br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. This is done by performing the regular pretraining methods (RBMs, autoencoders, ... etc). After pretraining, the weights are scaled up by factor <math> 1/p </math>, and then dropout finetuning is applied. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in \mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
</math><br />
<br />
This reduce to <br />
<br />
<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2<br />
</math><br />
<br />
where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26652dropout2015-11-19T20:44:24Z<p>Arashwan: /* Model */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p can be set using a validation set, or can be set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math> , where <math> f </math> is the activation function.<br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in \mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
</math><br />
<br />
This reduce to <br />
<br />
<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2<br />
</math><br />
<br />
where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26651dropout2015-11-19T20:18:33Z<p>Arashwan: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p can be set using a validation set, or can be set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in \mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
</math><br />
<br />
This reduce to <br />
<br />
<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2<br />
</math><br />
<br />
where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=proposal_for_STAT946_(Deep_Learning)_final_projects_Fall_2015&diff=26471proposal for STAT946 (Deep Learning) final projects Fall 20152015-11-18T17:48:06Z<p>Arashwan: </p>
<hr />
<div>'''Project 0:''' (This is just an example)<br />
<br />
'''Group members:'''first name family name, first name family name, first name family name<br />
<br />
'''Title:''' Sentiment Analysis on Movie Reviews<br />
<br />
''' Description:''' The idea and data for this project is taken from http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews.<br />
Sentiment analysis is the problem of determining whether a given string contains positive or negative sentiment. For example, “A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story” contains negative sentiment, but it is not immediately clear which parts of the sentence make it so.<br />
This competition seeks to implement machine learning algorithms that can determine the sentiment of a movie review<br />
<br />
'''Project 1:'''<br />
<br />
'''Group members:''' Sean Aubin, Brent Komer<br />
<br />
'''Title:''' Convolution Neural Networks in SLAM<br />
<br />
''' Description:''' We will try to replicate the results reported in [http://arxiv.org/abs/1411.1509 Convolutional Neural Networks-based Place Recognition] using [http://caffe.berkeleyvision.org/ Caffe] and [http://arxiv.org/abs/1409.4842 Google-net]. As a "stretch" goal, we will try to convert the CNN to a spiking neural network (a technique created by Eric Hunsberger) for greater biological plausibility and easier integration with other cognitive systems using Nengo. This work will help Brent with starting his PHD investigating cognitive localisation systems and object manipulation.<br />
<br />
'''Project 2:'''<br />
<br />
'''Group members:''' Xinran Liu, Fatemeh Karimi, Deepak Rishi & Chris Choi<br />
<br />
'''Title:''' Image Classification with Deep Learning<br />
<br />
''' Description:''' Our aim is to participate in the Digital Recognizer Kaggle Challenge, where one has to correctly classify the Modified National Institute of Standards and Technology (MNIST) dataset of handwritten numerical digits. For our first approach we propose using a simple Feed-Forward Neural Network to form a baseline for comparison. We then plan on experimenting on different aspects of a Neural Network such as network architecture, activation functions and incorporate a wide variety of training methods.<br />
<br />
'''Project 3'''<br />
<br />
'''Group members:''' Ri Wang, Maysum Panju, Mahmood Gohari<br />
<br />
'''Title:''' Machine Translation Using Neural Networks<br />
<br />
'''Description:''' The goal of this project is to translate languages using different types of neural networks and the algorithms described in "Sequence to sequence learning with neural networks." and "Neural machine translation by jointly learning to align and translate". Different vector representations for input sentences (word frequency, Word2Vec, etc) will be used and all combinations of algorithms will be ranked in terms of accuracy.<br />
Our data will mainly be from [http://www.statmt.org/europarl/ Europarl] and [https://tatoeba.org/eng Tatoeba]. The common target language will be English to allow for easier judgement of translation quality.<br />
<br />
'''Project 4'''<br />
<br />
'''Group members:''' Peter Blouw, Jan Gosmann<br />
<br />
'''Title:''' Using Structured Representations in Memory Networks to Perform Question Answering<br />
<br />
'''Description:''' Memory networks are machine learning systems that combine memory and inference to perform tasks that involve sophisticated reasoning (see [http://arxiv.org/pdf/1410.3916.pdf here] and [http://arxiv.org/pdf/1502.05698v7.pdf here]). Our goal in this project is to first implement a memory network that replicates prior performance on the bAbl question-answering tasks described in [http://arxiv.org/pdf/1502.05698v7.pdf Weston et al. (2015)]. Then, we hope to improve upon this baseline performance by using more sophisticated representations of the sentences that encode questions being posed to the network. Current implementations often use a bag of words encoding, which throws out important syntactic information that is relevant to determining what a particular question is asking. As such, we will explore the use of things like POS tags, n-gram information, and parse trees to augment memory network performance.<br />
<br />
'''Project 5'''<br />
<br />
'''Group members:''' Anthony Caterini, Tim Tse<br />
<br />
'''Title:''' The Allen AI Science Challenge<br />
<br />
'''Description:''' The goal of this project is to create an artificial intelligence model that can answer multiple-choice questions on a grade 8 science exam, with a success rate better than the best 8th graders. This will involve a deep neural network as the underlying model, to help parse the large amount of information needed to answer these questions. The model should also learn, over time, how to make better answers by acquiring more and more data. This is a Kaggle challenge, and the link to the challenge is [https://www.kaggle.com/c/the-allen-ai-science-challenge here]. The data to produce the model will come from the Kaggle website.<br />
<br />
'''Project 6''' <br />
<br />
'''Group members:''' Valerie Platsko<br />
<br />
'''Title:''' Classification for P300-Speller Using Convolutional Neural Networks <br />
<br />
''' Description:''' The goal of this project is to replicate (and possibly extend) the results in [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5492691 Convolutional Neural Networks for P300 Detection with Application to Brain-Computer Interfaces], which used convolutional neural networks to recognize P300 responses in recorded EEG and additionally to correctly recognize attended targets.(In the P300-Speller application, letters flash in rows and columns, so a single P300 response is associated with multiple potential targets.) The data in the paper came from http://www.bbci.de/competition/iii/ (dataset II), and there is an additional P300 Speller dataset available from [http://www.bbci.de/competition/ii/ a previous version of the competition].<br />
<br />
'''Project 7''' <br />
<br />
'''Group members:''' Amirreza Lashkari, Derek Latremouille, Rui Qiao and Luyao Ruan<br />
<br />
'''Title:''' Right Whale Recognition <br />
<br />
''' Description:''' The goal of this project is to design an automated right whale recognition process using a dataset of aerial photographs of individual whales. To do so, a deep neural network will be applied in order to extract features and classify objects (whales in this problem). This is a Kaggle challenge, and data is also provided by this challenge (see [https://www.kaggle.com/c/noaa-right-whale-recognition here]).<br />
<br />
'''Project 8'''<br />
<br />
'''Group members:''' Abdullah Rashwan and Priyank Jaini<br />
<br />
'''Title:''' Learning the Parameters for Continuous Distribution Sum-Product Networks using Bayesian Moment Matching<br />
<br />
'''Description:''' Sum-Product Networks have generated interest due to their ability to do exact inference in linear time with respect to the size of the network. Parameter learning however still is a problem. We have proposed an online Bayesian Moment Matching algorithm to learn the parameters for discrete distributions, in this work, we are extending the algorithm to learn the parameters for continuous distributions as well.</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26414deep Convolutional Neural Networks For LVCSR2015-11-17T22:03:48Z<p>Arashwan: /* Switchboard */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
Broadcast News consists of 400 hours of speech data and it was used for training. DARPA EARS rt04 and def04f datasets were used for testing. The following table shows that CNN-based features offer 13-18% relative improvment over GMM/HMM system and 10-12% over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
<br />
== Switchboard ==<br />
<br />
Switchboard dataset is a 300 hours of conversational American English telephony data. Hub5'00 dataset is used as validation set, while rt03 set is used for testing. Switchboard (SWB) and Fisher (FSH) are portions of the set, and the results are reported separately for each set. Three systems, as shown in the following table, were used in comparisons. CNN-based features over 13-33% relative improvement over GMM/HMM system, and 4-7% relative improvement over hybrid DNN system. These results show that CNNs are superior to both GMMs and DNNs.<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
<br />
= Conclusions and Discussions =<br />
<br />
In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems.<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26413deep Convolutional Neural Networks For LVCSR2015-11-17T21:59:17Z<p>Arashwan: /* Broadcast News */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
Broadcast News consists of 400 hours of speech data and it was used for training. DARPA EARS rt04 and def04f datasets were used for testing. The following table shows that CNN-based features offer 13-18% relative improvment over GMM/HMM system and 10-12% over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
<br />
== Switchboard ==<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
= Conclusions and Discussions =<br />
<br />
In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems.<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26411deep Convolutional Neural Networks For LVCSR2015-11-17T21:52:21Z<p>Arashwan: </p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
== Switchboard ==<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
= Conclusions and Discussions =<br />
<br />
In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems.<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26410deep Convolutional Neural Networks For LVCSR2015-11-17T21:51:51Z<p>Arashwan: /* Conclusions and Discussions */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
== Switchboard ==<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
= Conclusions and Discussions =<br />
<br />
In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems.<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. [].<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26409deep Convolutional Neural Networks For LVCSR2015-11-17T21:46:28Z<p>Arashwan: /* Conclusions and Discussions */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
== Switchboard ==<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
= Conclusions and Discussions =<br />
<br />
<br />
<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. [].<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26408deep Convolutional Neural Networks For LVCSR2015-11-17T21:45:28Z<p>Arashwan: </p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
== Switchboard ==<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26407deep Convolutional Neural Networks For LVCSR2015-11-17T21:39:57Z<p>Arashwan: /* Conclusions and Discussions */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
= Conclusions and Discussions =<br />
<br />
Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
They didn't compare to the CNN system proposed by Osama et. al. [].<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26406deep Convolutional Neural Networks For LVCSR2015-11-17T21:39:02Z<p>Arashwan: /* Conclusions and Discussions */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
= Conclusions and Discussions =<br />
<br />
Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26405deep Convolutional Neural Networks For LVCSR2015-11-17T21:19:05Z<p>Arashwan: /* Results with Proposed Architecture */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26400deep Convolutional Neural Networks For LVCSR2015-11-17T20:22:16Z<p>Arashwan: </p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
= Results on Large Tasks =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26369deep Convolutional Neural Networks For LVCSR2015-11-17T03:38:39Z<p>Arashwan: /* Optimal Feature Set */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26366deep Convolutional Neural Networks For LVCSR2015-11-17T03:31:54Z<p>Arashwan: </p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26363deep Convolutional Neural Networks For LVCSR2015-11-17T03:21:55Z<p>Arashwan: </p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26362deep Convolutional Neural Networks For LVCSR2015-11-17T03:10:26Z<p>Arashwan: /* Number of Hidden Units */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
== Optimal Feature Set ==<br />
<br />
== Pooling Experiments ==<br />
<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26358deep Convolutional Neural Networks For LVCSR2015-11-17T02:48:54Z<p>Arashwan: /* Number of Convolutional vs. Fully Connected Layers */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
<br />
== Optimal Feature Set ==<br />
<br />
== Pooling Experiments ==<br />
<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26357deep Convolutional Neural Networks For LVCSR2015-11-17T02:47:55Z<p>Arashwan: /* Number of Convolutional vs. Fully Connected Layers */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.! header 1<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
<br />
== Optimal Feature Set ==<br />
<br />
== Pooling Experiments ==<br />
<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26342deep Convolutional Neural Networks For LVCSR2015-11-17T02:10:48Z<p>Arashwan: /* CNN Architecture */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
<br />
== Number of Hidden Units ==<br />
<br />
== Optimal Feature Set ==<br />
<br />
== Pooling Experiments ==<br />
<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Convnets.png&diff=26330File:Convnets.png2015-11-17T02:02:13Z<p>Arashwan: convolutional neural networks</p>
<hr />
<div>convolutional neural networks</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26285deep Convolutional Neural Networks For LVCSR2015-11-16T19:31:38Z<p>Arashwan: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
== Experimental Setup ==<br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
<br />
== Number of Hidden Units ==<br />
<br />
== Optimal Feature Set ==<br />
<br />
== Pooling Experiments ==<br />
<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=26270deep Convolutional Neural Networks For LVCSR2015-11-16T15:57:57Z<p>Arashwan: Created page with "= Introduction = = CNN Architecture = == Experimental Setup == == Number of Convolutional vs. Fully Connected Layers == == Number of Hidden Units == == Optimal Feature Set =..."</p>
<hr />
<div>= Introduction =<br />
<br />
= CNN Architecture =<br />
<br />
== Experimental Setup ==<br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
<br />
== Number of Hidden Units ==<br />
<br />
== Optimal Feature Set ==<br />
<br />
== Pooling Experiments ==<br />
<br />
= Results =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=26106deep neural networks for acoustic modeling in speech recognition2015-11-11T17:44:40Z<p>Arashwan: /* Summary for the Main Results for DNN Acoustic Models on Large Data Sets */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
DNNs are feed-forward neural networks that have multiple of hidden layers. The last layer is a softmax layer which gives the class probabilities. The weights for the DNNs are learnt using backpropagation algorithm, it was found empirically that computing the gradient using small random mini-batches is more efficient. To avoid overfitting, early stopping is used by stopping the training when the accuracy over validation set starts to decrease. The pretraining is essential when the amount of training data is small. Restricted Boltzmann Machines (RBMs) are used for pretraining except for the first layer which uses Gaussian-Bernoulli RBM (GRBM) since the input is real-value.<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>p(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| YouTube<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN><br />
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent<br />
deep neural networks for conversational speech transcription,” in Proc.<br />
IEEE ASRU, 2011, pp. 24–29.<br />
</ref>.<br />
<br />
= Conclusions and Discussions =<br />
GMMs have been used widely for acoustic modelling, they are easy to train and quite flexible. Since 2009, DNNs were proposed to replace GMMs, they have been proven to be superior to GMMs in many speech recognition tasks. DNNs pretraining is essential when the amount of data is small, it reduces overfitting and the convergence time. Fine-tuning the network to optimize the mutual information can improve the results. The authors think that there are yet many things that can be done in pretraining, fine-tuning, and using different types of hidden units to further increase the performance of DNNs.<br />
<br />
This paper summarizes the recent research that has been done by three research groups in the area of speech recognition. They have shown that DNNs are superior to GMMs in both small and large dataset speech recognition tasks. The authors claimed that the reason for such superiority is that the speech data lies on a manifold which I am not sure if there is any scientific/empirical proof for such claim.<br />
This paper is an excellent source if someone is interested in deep learning in speech recognition. The author assume that the readers are familiar with speech recognition HMM framework, otherwise it will be difficult to follow. Also, the DNN area is moving fast, and the information in the paper is not up to date.<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26101graves et al., Speech recognition with deep recurrent neural networks2015-11-10T17:52:36Z<p>Arashwan: </p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,”IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a description of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= References =<br />
<br />
A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in <span>''Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on''</span>, pp. 6645–6649, IEEE, 2013.<br />
<br />
C. Lopes and F. Perdig<span>ã</span>o, “Phone recognition on the timit database,” <span>''Speech Technologies/Book''</span>, vol. 1, pp. 285–302, 2011.<br />
<br />
A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” <span>''Audio, Speech, and Language Processing, IEEE Transactions on''</span>, vol. 20, no. 1, pp. 14–22, 2012.<br />
<br />
F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with lstm recurrent networks,” <span>''The Journal of Machine Learning Research''</span>, vol. 3, pp. 115–143, 2003.<br />
<br />
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in <span>''Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on''</span>, vol. 4, pp. 2047–2052, IEEE, 2005.</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26100graves et al., Speech recognition with deep recurrent neural networks2015-11-10T17:51:18Z<p>Arashwan: /* Overview */</p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,”IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a desription of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data prepocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is neglibile difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be signifigant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= References =<br />
<br />
A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in <span>''Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on''</span>, pp. 6645–6649, IEEE, 2013.<br />
<br />
C. Lopes and F. Perdig<span>ã</span>o, “Phone recognition on the timit database,” <span>''Speech Technologies/Book''</span>, vol. 1, pp. 285–302, 2011.<br />
<br />
A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” <span>''Audio, Speech, and Language Processing, IEEE Transactions on''</span>, vol. 20, no. 1, pp. 14–22, 2012.<br />
<br />
F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with lstm recurrent networks,” <span>''The Journal of Machine Learning Research''</span>, vol. 3, pp. 115–143, 2003.<br />
<br />
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in <span>''Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on''</span>, vol. 4, pp. 2047–2052, IEEE, 2005.</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26098graves et al., Speech recognition with deep recurrent neural networks2015-11-10T17:41:53Z<p>Arashwan: /* Motivation */</p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 setences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alonside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classfication error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,”IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a desription of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data prepocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is neglibile difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be signifigant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= References =<br />
<br />
A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in <span>''Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on''</span>, pp. 6645–6649, IEEE, 2013.<br />
<br />
C. Lopes and F. Perdig<span>ã</span>o, “Phone recognition on the timit database,” <span>''Speech Technologies/Book''</span>, vol. 1, pp. 285–302, 2011.<br />
<br />
A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” <span>''Audio, Speech, and Language Processing, IEEE Transactions on''</span>, vol. 20, no. 1, pp. 14–22, 2012.<br />
<br />
F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with lstm recurrent networks,” <span>''The Journal of Machine Learning Research''</span>, vol. 3, pp. 115–143, 2003.<br />
<br />
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in <span>''Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on''</span>, vol. 4, pp. 2047–2052, IEEE, 2005.</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=26096deep neural networks for acoustic modeling in speech recognition2015-11-10T16:28:21Z<p>Arashwan: /* Fine-Tuning DNNs To Optimize Mutual Information */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
DNNs are feed-forward neural networks that have multiple of hidden layers. The last layer is a softmax layer which gives the class probabilities. The weights for the DNNs are learnt using backpropagation algorithm, it was found empirically that computing the gradient using small random mini-batches is more efficient. To avoid overfitting, early stopping is used by stopping the training when the accuracy over validation set starts to decrease. The pretraining is essential when the amount of training data is small. Restricted Boltzmann Machines (RBMs) are used for pretraining except for the first layer which uses Gaussian-Bernoulli RBM (GRBM) since the input is real-value.<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>p(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN><br />
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent<br />
deep neural networks for conversational speech transcription,” in Proc.<br />
IEEE ASRU, 2011, pp. 24–29.<br />
</ref>.<br />
<br />
= Conclusions and Discussions =<br />
GMMs have been used widely for acoustic modelling, they are easy to train and quite flexible. Since 2009, DNNs were proposed to replace GMMs, they have been proven to be superior to GMMs in many speech recognition tasks. DNNs pretraining is essential when the amount of data is small, it reduces overfitting and the convergence time. Fine-tuning the network to optimize the mutual information can improve the results. The authors think that there are yet many things that can be done in pretraining, fine-tuning, and using different types of hidden units to further increase the performance of DNNs.<br />
<br />
This paper summarizes the recent research that has been done by three research groups in the area of speech recognition. They have shown that DNNs are superior to GMMs in both small and large dataset speech recognition tasks. The authors claimed that the reason for such superiority is that the speech data lies on a manifold which I am not sure if there is any scientific/empirical proof for such claim.<br />
This paper is an excellent source if someone is interested in deep learning in speech recognition. The author assume that the readers are familiar with speech recognition HMM framework, otherwise it will be difficult to follow. Also, the DNN area is moving fast, and the information in the paper is not up to date.<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=25977f15Stat946PaperSignUp2015-11-09T17:31:58Z<p>Arashwan: /* Set B */</p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || From Machine Learning to Machine Reasoning ||[http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]||<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/abs/10.1021/ci500747n.pdf Paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || A fast learning algorithm for deep belief nets || [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 Paper] || [[A fast learning algorithm for deep belief nets | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || || Towards AI-complete question answering: a set of prerequisite toy tasks || [http://arxiv.org/pdf/1502.05698.pdf Paper] ||<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]||<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] ||<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] ||<br />
|-<br />
|Tim Tse|| || Question answering with subgraph embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] ||<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|-<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=25976f15Stat946PaperSignUp2015-11-09T17:31:32Z<p>Arashwan: /* Set B */</p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || From Machine Learning to Machine Reasoning ||[http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]||<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/abs/10.1021/ci500747n.pdf Paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || A fast learning algorithm for deep belief nets || [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 Paper] || [[A fast learning algorithm for deep belief nets | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || || Towards AI-complete question answering: a set of prerequisite toy tasks || [http://arxiv.org/pdf/1502.05698.pdf Paper] ||<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]||<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] ||<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] ||<br />
|-<br />
|Tim Tse|| || Question answering with subgraph embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] ||<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]</div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25907deep neural networks for acoustic modeling in speech recognition2015-11-06T18:26:05Z<p>Arashwan: /* Training Deep Neural Networks */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
DNNs are feed-forward neural networks that have multiple of hidden layers. The last layer is a softmax layer which gives the class probabilities. The weights for the DNNs are learnt using backpropagation algorithm, it was found empirically that computing the gradient using small random mini-batches is more efficient. To avoid overfitting, early stopping is used by stopping the training when the accuracy over validation set starts to decrease. The pretraining is essential when the amount of training data is small. Restricted Boltzmann Machines (RBMs) are used for pretraining except for the first layer which uses Gaussian-Bernoulli RBM (GRBM) since the input is real-value.<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN><br />
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent<br />
deep neural networks for conversational speech transcription,” in Proc.<br />
IEEE ASRU, 2011, pp. 24–29.<br />
</ref>.<br />
<br />
= Conclusions and Discussions =<br />
GMMs have been used widely for acoustic modelling, they are easy to train and quite flexible. Since 2009, DNNs were proposed to replace GMMs, they have been proven to be superior to GMMs in many speech recognition tasks. DNNs pretraining is essential when the amount of data is small, it reduces overfitting and the convergence time. Fine-tuning the network to optimize the mutual information can improve the results. The authors think that there are yet many things that can be done in pretraining, fine-tuning, and using different types of hidden units to further increase the performance of DNNs.<br />
<br />
This paper summarizes the recent research that has been done by three research groups in the area of speech recognition. They have shown that DNNs are superior to GMMs in both small and large dataset speech recognition tasks. The authors claimed that the reason for such superiority is that the speech data lies on a manifold which I am not sure if there is any scientific/empirical proof for such claim.<br />
This paper is an excellent source if someone is interested in deep learning in speech recognition. The author assume that the readers are familiar with speech recognition HMM framework, otherwise it will be difficult to follow. Also, the DNN area is moving fast, and the information in the paper is not up to date.<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25861deep neural networks for acoustic modeling in speech recognition2015-11-05T21:36:37Z<p>Arashwan: /* Conclusions and Discussions */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN><br />
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent<br />
deep neural networks for conversational speech transcription,” in Proc.<br />
IEEE ASRU, 2011, pp. 24–29.<br />
</ref>.<br />
<br />
= Conclusions and Discussions =<br />
GMMs have been used widely for acoustic modelling, they are easy to train and quite flexible. Since 2009, DNNs were proposed to replace GMMs, they have been proven to be superior to GMMs in many speech recognition tasks. DNNs pretraining is essential when the amount of data is small, it reduces overfitting and the convergence time. Fine-tuning the network to optimize the mutual information can improve the results. The authors think that there are yet many things that can be done in pretraining, fine-tuning, and using different types of hidden units to further increase the performance of DNNs.<br />
<br />
This paper summarizes the recent research that has been done by three research groups in the area of speech recognition. They have shown that DNNs are superior to GMMs in both small and large dataset speech recognition tasks. The authors claimed that the reason for such superiority is that the speech data lies on a manifold which I am not sure if there is any scientific/empirical proof for such claim.<br />
This paper is an excellent source if someone is interested in deep learning in speech recognition. The author assume that the readers are familiar with speech recognition HMM framework, otherwise it will be difficult to follow. Also, the DNN area is moving fast, and the information in the paper is not up to date.<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25846deep neural networks for acoustic modeling in speech recognition2015-11-05T17:39:43Z<p>Arashwan: /* Alternative Pretraining Methods for DNNs */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN><br />
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent<br />
deep neural networks for conversational speech transcription,” in Proc.<br />
IEEE ASRU, 2011, pp. 24–29.<br />
</ref>.<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25845deep neural networks for acoustic modeling in speech recognition2015-11-05T17:30:12Z<p>Arashwan: /* Convolutional DNNs for Phone Classification and Recognition */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25844deep neural networks for acoustic modeling in speech recognition2015-11-05T17:27:11Z<p>Arashwan: /* Convolutional DNNs for Phone Classification and Recognition */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25843deep neural networks for acoustic modeling in speech recognition2015-11-05T16:59:11Z<p>Arashwan: /* English Broadcast News Speech Recognition Task */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25842deep neural networks for acoustic modeling in speech recognition2015-11-05T16:29:31Z<p>Arashwan: /* Summary for the Main Results for DNN Acoustic Models on Large Data Sets */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25835deep neural networks for acoustic modeling in speech recognition2015-11-04T21:54:20Z<p>Arashwan: /* Summary for the Main Results for DNN Acoustic Models on Large Data Sets */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25834deep neural networks for acoustic modeling in speech recognition2015-11-04T21:52:09Z<p>Arashwan: /* Summary for the Main Results for DNN Acoustic Models on Large Data Sets */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870)<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25833deep neural networks for acoustic modeling in speech recognition2015-11-04T21:51:47Z<p>Arashwan: /* Summary for the Main Results for DNN Acoustic Models on Large Data Sets */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3 (>> 5870)<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=25832deep neural networks for acoustic modeling in speech recognition2015-11-04T21:50:11Z<p>Arashwan: /* DNN for Large-Vocabulary Speech Recognition */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
<br />
= Conclusions and Discussions =<br />
<br />
= References =<br />
<references /></div>Arashwan