http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=X435liu&feedformat=atomstatwiki - User contributions [US]2022-01-22T08:36:10ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27112generating text with recurrent neural networks2015-12-07T01:06:34Z<p>X435liu: /* Qualitative Experiments */</p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases. <br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
<br />
= Bibliography = <br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Convolutional_Feature_Hierarchies_for_Visual_Recognition&diff=27103learning Convolutional Feature Hierarchies for Visual Recognition2015-12-06T22:24:40Z<p>X435liu: /* Method */</p>
<hr />
<div>=Overview=<br />
<br />
This paper<ref>Kavukcuoglu, K, Sermanet, P, Boureau, Y, Gregor, K, Mathieu, M, and Cun, Y. . Learning convolutional feature hierarchies for visual recognition. In Advances in neural information processing systems, 1090-1098, 2010.</ref> describes methods for learning features extracted through convolutional feature banks. In particular, it gives methods for using sparse coding convolutionally. In sparse coding, the sparse feature vector z is constructed to reconstruct the input x with a dictionary D. The procedure produces a code z* by minimizing the energy function:<br />
<br />
:<math>L(x,z,D) = \frac{1}{2}||x-Dz||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ min}} \ L(x,z,D)</math><br />
<br />
D is obtained by minimizing the above with respect to D: <math>\underset{z,D}{\operatorname{arg\ min}} \ L(x,z,D)</math>, averaged over the training set. The drawbacks to this method are the the representation is redundant and that the inference for a whole image is computationally expensive. The reason is that the system is trained on single image patches in most applications of sparse coding to image analysis, which produces a dictionary of filters that are essentially shifted versions of each other over the patch and reconstruct in isolation.<br />
<br />
This first problem can be addressed by applying sparse coding to the entire image and treating the dictionary as a convolutional filter bank.<br />
<br />
:<math>L(x,z,D) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + |z|_1</math><br />
<br />
Where D<sub>k</sub> is an s×s filter kernel, x is a w×h image, z<sub>k</sub> is a feature map of dimension (w+s-1)×(h+s-1), and * denotes the discrete convolution operator.<br />
<br />
The second problem can be addressed by using a trainable feed-forward encoder to approximate the sparse code:<br />
<br />
:<math>L(z,z,D,W) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + \sum_{k=1}^K||z_k - f(W^k*x)||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ max}} \ L(x,z,D,W) </math><br />
<br />
Where W<sup>k</sup> is an encoding convolutional kernel of size s×s, and f is a point-wise non-linear function. Both the form of f and the method to find z* are discussed below.<br />
<br />
The contribution of this paper is to address these two issues simultaneously, thus allowing convolutional approaches to sparse coding.<br />
<br />
=Method=<br />
<br />
The authors extend the coordinate descent sparse coding algorithm detailed in <ref>Li, Y and Osher, S. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.</ref> to use convolutional methods.<br />
<br />
Two considerations for learning convolution dictionaries are:<br />
#Boundary effects due to convolution must be handled.<br />
#Derivatives should be calculated efficiently.<br />
<br />
----<br />
'''function ConvCoD'''<math>\, (x,D,\alpha)</math><br />
<br />
:'''Set:''' <math>\, S = D^T*D</math><br />
<br />
:'''Initalize:''' <math>\, z = 0;\ \beta = D^T * mask(x)</math><br />
<br />
:'''Require:''' <math>\, h_\alpha:</math>: smooth thresholding function<br />
<br />
:'''repeat'''<br />
<br />
::<math>\, \bar{x} = h_\alpha(\beta)</math><br />
<br />
::<math>\, (k,p,q) = \underset{i,m,n}{\operatorname{arg\ max}} |z_{imn}-\bar{z_{imn}}|</math> (k: dictionary index, (p,q) location index)<br />
<br />
::<math>\, bi = \beta_{kpq}</math><br />
<br />
::<math>\, \beta = \beta + (z_kpg - \bar{z_{kpg}}) \times align(S(:,k,:,:),(p,q))</math> **<br />
<br />
::<math>\, z_{kpg} = \bar{z_{kpg}},\ \beta_{kpg} = bi</math><br />
<br />
:'''until''' change in <math>z</math> is below a threshold<br />
<br />
:'''end function'''<br />
----<br />
<nowiki>**</nowiki> MATLAB notation is used for slicing the tensor.<br />
<br />
In the above, <math>\beta = D^T * mask(x)</math> is use to handle boundary effects, where mask operates term by term and either puts zeros or scales down the boundaries.<br />
<br />
The learning procedure is then stochastic gradient descent over the dictionary D, where the columns of D are normalized after each iteration.<br />
<br />
:<math>\forall x^i \in X</math> training set: <math>z* = \underset{z}{\operatorname{arg\ max}}\ L(x^i,z,d), D \leftarrow D - \eta \frac {\partial(L,x^i,z^*,D}{\partial D}</math><br />
<br />
Two encoder architectures are tested. The first is steepest descent sparse coding with tanh encoding function using <math>g^k \times tanh(x*W^k)</math>, which does not include a shrinkage operator. Thus the ability to produce sparse representations is very limited.<br />
<br />
The second is convolutional CoD sparse coding with a smooth shrinkage operator as defined below. <br />
<br />
:<math>\tilde{z}=sh_{\beta^k,b^k}(x*W^k)</math> where k = 1..K.<br />
<br />
:<math>sh_{\beta^k,b^k}(s) = sign(s) \times 1/\beta^k \log(\exp(\beta^k \times |s|) - 1) - b^k</math><br />
<br />
where <math>\beta</math> controls the smoothness of the kink of shrinkage operator and b controls the location of the kink. The second system in more efficient in training, but the performance for both systems are almost identical.<br />
<br />
The convolutional encoder can also be used in multi-stage object recognition architectures. For each stage, the encoder is followed by absolute value rectification, contrast normalization and average subsampling.<br />
<br />
=Experiments=<br />
<br />
Two systems are used:<br />
#Steepest descent sparse coding with tanh encoder: <math>SD^{tanh}</math><br />
#Coordinate descent sparse coding with shrink encoder: <math>CD^{shrink}</math><br />
<br />
==Object Recognition using Caltech-101 Dataset==<br />
<br />
In the Caltech-101 dataset, each image contains a single object. Each image is processed by converting to grayscale and resizing, followed by contrast normalization. All results use 30 training samples per class and 5 different choices of the training set.<br />
<br />
''Architecture:'' 64 features are extracted by the first layer, followed by a second layer that produces 256 features. Second layer features are connected to first layer features by a sparse connection table.<br />
<br />
''First Layer:'' Both systems are trained using 64 dictionary elements, whether each dictionary item is a 9×9 convolution kernel. Both systems are trained for 10 sparsity values from 0.1-3.0.<br />
<br />
''Second Layer:'' In the second layer, each of 256 feature maps in connected to 16 randomly selected input features from the first layer.<br />
<br />
''One Stage System:'' In these results, the input is passed to the first layer, followed by absolute value rectification, contrast normalization, and average pooling. The output of the first layer is fed to a logistic classifier followed by the PMK-SVN classifier used in <ref>Lazebnik, S, Schmid, C, and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06, 2:2169–2178, 2006.</ref>.<br />
<br />
''Two Stage System:'' These results use both layers, followed by absolute value rectification, contrast normalization, and average pooling. Finally, a multinomial logistic regression classifier is used.<br />
<br />
[[File:CoD_results.png]]<br />
<br />
In the above, U represents one stage, UU represents two stages, and '+' represents supervise training is performed afterwards.<br />
<br />
==Pedestrian Detection==<br />
<br />
The architecture is trained and evaluated on the INRIA Pedestrian dataset <ref>Dalal, N and Triggs, B. Histograms of oriented gradients for human detection. In Schmid, C, Soatto, S, and Tomasi, C, editors, CVPR’05, volume 2, pages 886–893, June 2005.</ref> which contains 2416 positive examples (after mirroring) and 1218 negative full images. For training, the dataset is augmented with minor translations and scaling, giving a total of 11370 examples for training and 1000 images for classification. The negative examples are augmented with larger scale variations to avoid false positives, giving a total of 9001 samples for training and 1000 for validation.<br />
<br />
The architecture for the pedestrian detection task is similar to that describe in the previous section. It was trained both with and without unsupervised initialization, followed by supervised training. After one pass of training, the negative set was augmented with the 10 most offending samples on each full negative image.<br />
<br />
[[File:CoD_pedestrian_results.png]]<br />
<br />
=Discussion=<br />
*The paper presented an efficient method for convolutional training of feature extractors.<br />
*The resulting features look intuitively better than those obtained through non-convolutional methods, but classification results are only slightly better (where they're better at all) than existing methods.<br />
*It's not clear what effects in the pedestrian experiment are due to the method of preprocessing and variations on the dataset (scaling and translation) and which are due to the architecture itself. Comparisons are with other systems that processed input differently.<br />
<br />
=References=<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Convolutional_Feature_Hierarchies_for_Visual_Recognition&diff=27102learning Convolutional Feature Hierarchies for Visual Recognition2015-12-06T22:17:10Z<p>X435liu: /* Method */</p>
<hr />
<div>=Overview=<br />
<br />
This paper<ref>Kavukcuoglu, K, Sermanet, P, Boureau, Y, Gregor, K, Mathieu, M, and Cun, Y. . Learning convolutional feature hierarchies for visual recognition. In Advances in neural information processing systems, 1090-1098, 2010.</ref> describes methods for learning features extracted through convolutional feature banks. In particular, it gives methods for using sparse coding convolutionally. In sparse coding, the sparse feature vector z is constructed to reconstruct the input x with a dictionary D. The procedure produces a code z* by minimizing the energy function:<br />
<br />
:<math>L(x,z,D) = \frac{1}{2}||x-Dz||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ min}} \ L(x,z,D)</math><br />
<br />
D is obtained by minimizing the above with respect to D: <math>\underset{z,D}{\operatorname{arg\ min}} \ L(x,z,D)</math>, averaged over the training set. The drawbacks to this method are the the representation is redundant and that the inference for a whole image is computationally expensive. The reason is that the system is trained on single image patches in most applications of sparse coding to image analysis, which produces a dictionary of filters that are essentially shifted versions of each other over the patch and reconstruct in isolation.<br />
<br />
This first problem can be addressed by applying sparse coding to the entire image and treating the dictionary as a convolutional filter bank.<br />
<br />
:<math>L(x,z,D) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + |z|_1</math><br />
<br />
Where D<sub>k</sub> is an s×s filter kernel, x is a w×h image, z<sub>k</sub> is a feature map of dimension (w+s-1)×(h+s-1), and * denotes the discrete convolution operator.<br />
<br />
The second problem can be addressed by using a trainable feed-forward encoder to approximate the sparse code:<br />
<br />
:<math>L(z,z,D,W) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + \sum_{k=1}^K||z_k - f(W^k*x)||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ max}} \ L(x,z,D,W) </math><br />
<br />
Where W<sup>k</sup> is an encoding convolutional kernel of size s×s, and f is a point-wise non-linear function. Both the form of f and the method to find z* are discussed below.<br />
<br />
The contribution of this paper is to address these two issues simultaneously, thus allowing convolutional approaches to sparse coding.<br />
<br />
=Method=<br />
<br />
The authors extend the coordinate descent sparse coding algorithm detailed in <ref>Li, Y and Osher, S. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.</ref> to use convolutional methods.<br />
<br />
Two considerations for learning convolution dictionaries are:<br />
#Boundary effects due to convolution must be handled.<br />
#Derivatives should be calculated efficiently.<br />
<br />
----<br />
'''function ConvCoD'''<math>\, (x,D,\alpha)</math><br />
<br />
:'''Set:''' <math>\, S = D^T*D</math><br />
<br />
:'''Initalize:''' <math>\, z = 0;\ \beta = D^T * mask(x)</math><br />
<br />
:'''Require:''' <math>\, h_\alpha:</math>: smooth thresholding function<br />
<br />
:'''repeat'''<br />
<br />
::<math>\, \bar{x} = h_\alpha(\beta)</math><br />
<br />
::<math>\, (k,p,q) = \underset{i,m,n}{\operatorname{arg\ max}} |z_{imn}-\bar{z_{imn}}|</math> (k: dictionary index, (p,q) location index)<br />
<br />
::<math>\, bi = \beta_{kpq}</math><br />
<br />
::<math>\, \beta = \beta + (z_kpg - \bar{z_{kpg}}) \times align(S(:,k,:,:),(p,q))</math> **<br />
<br />
::<math>\, z_{kpg} = \bar{z_{kpg}},\ \beta_{kpg} = bi</math><br />
<br />
:'''until''' change in <math>z</math> is below a threshold<br />
<br />
:'''end function'''<br />
----<br />
<nowiki>**</nowiki> MATLAB notation is used for slicing the tensor.<br />
<br />
In the above, <math>\beta = D^T * mask(x)</math> is use to handle boundary effects, where mask operates term by term and either puts zeros or scales down the boundaries.<br />
<br />
The learning procedure is then stochastic gradient descent over the dictionary D, where the columns of D are normalized after each iteration.<br />
<br />
:<math>\forall x^i \in X</math> training set: <math>z* = \underset{z}{\operatorname{arg\ max}}\ L(x^i,z,d), D \leftarrow D - \eta \frac {\partial(L,x^i,z^*,D}{\partial D}</math><br />
<br />
Two encoder architectures are tested. The first is steepest descent sparse coding with tanh encoding function using <math>g^k \times tanh(x*W^k)</math>, which does not include a shrinkage operator. Thus the ability to produce sparse representations is very limited.<br />
<br />
The second is convolutional CoD sparse coding with a smooth shrinkage operator as defined below. <br />
<br />
:<math>\tilde{z}=sh_{\beta^k,b^k}(x*W^k)</math> where k = 1..K.<br />
<br />
:<math>sh_{\beta^k,b^k}(s) = sign(s) \times 1/\beta^k \log(\exp(\beta^k \times |s|) - 1) - b^k</math><br />
<br />
where <math>\beta</math> controls the smoothness of the kink of shrinkage operator and b controls the location of the kink.<br />
<br />
=Experiments=<br />
<br />
Two systems are used:<br />
#Steepest descent sparse coding with tanh encoder: <math>SD^{tanh}</math><br />
#Coordinate descent sparse coding with shrink encoder: <math>CD^{shrink}</math><br />
<br />
==Object Recognition using Caltech-101 Dataset==<br />
<br />
In the Caltech-101 dataset, each image contains a single object. Each image is processed by converting to grayscale and resizing, followed by contrast normalization. All results use 30 training samples per class and 5 different choices of the training set.<br />
<br />
''Architecture:'' 64 features are extracted by the first layer, followed by a second layer that produces 256 features. Second layer features are connected to first layer features by a sparse connection table.<br />
<br />
''First Layer:'' Both systems are trained using 64 dictionary elements, whether each dictionary item is a 9×9 convolution kernel. Both systems are trained for 10 sparsity values from 0.1-3.0.<br />
<br />
''Second Layer:'' In the second layer, each of 256 feature maps in connected to 16 randomly selected input features from the first layer.<br />
<br />
''One Stage System:'' In these results, the input is passed to the first layer, followed by absolute value rectification, contrast normalization, and average pooling. The output of the first layer is fed to a logistic classifier followed by the PMK-SVN classifier used in <ref>Lazebnik, S, Schmid, C, and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06, 2:2169–2178, 2006.</ref>.<br />
<br />
''Two Stage System:'' These results use both layers, followed by absolute value rectification, contrast normalization, and average pooling. Finally, a multinomial logistic regression classifier is used.<br />
<br />
[[File:CoD_results.png]]<br />
<br />
In the above, U represents one stage, UU represents two stages, and '+' represents supervise training is performed afterwards.<br />
<br />
==Pedestrian Detection==<br />
<br />
The architecture is trained and evaluated on the INRIA Pedestrian dataset <ref>Dalal, N and Triggs, B. Histograms of oriented gradients for human detection. In Schmid, C, Soatto, S, and Tomasi, C, editors, CVPR’05, volume 2, pages 886–893, June 2005.</ref> which contains 2416 positive examples (after mirroring) and 1218 negative full images. For training, the dataset is augmented with minor translations and scaling, giving a total of 11370 examples for training and 1000 images for classification. The negative examples are augmented with larger scale variations to avoid false positives, giving a total of 9001 samples for training and 1000 for validation.<br />
<br />
The architecture for the pedestrian detection task is similar to that describe in the previous section. It was trained both with and without unsupervised initialization, followed by supervised training. After one pass of training, the negative set was augmented with the 10 most offending samples on each full negative image.<br />
<br />
[[File:CoD_pedestrian_results.png]]<br />
<br />
=Discussion=<br />
*The paper presented an efficient method for convolutional training of feature extractors.<br />
*The resulting features look intuitively better than those obtained through non-convolutional methods, but classification results are only slightly better (where they're better at all) than existing methods.<br />
*It's not clear what effects in the pedestrian experiment are due to the method of preprocessing and variations on the dataset (scaling and translation) and which are due to the architecture itself. Comparisons are with other systems that processed input differently.<br />
<br />
=References=<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Convolutional_Feature_Hierarchies_for_Visual_Recognition&diff=27101learning Convolutional Feature Hierarchies for Visual Recognition2015-12-06T21:58:12Z<p>X435liu: /* Overview */</p>
<hr />
<div>=Overview=<br />
<br />
This paper<ref>Kavukcuoglu, K, Sermanet, P, Boureau, Y, Gregor, K, Mathieu, M, and Cun, Y. . Learning convolutional feature hierarchies for visual recognition. In Advances in neural information processing systems, 1090-1098, 2010.</ref> describes methods for learning features extracted through convolutional feature banks. In particular, it gives methods for using sparse coding convolutionally. In sparse coding, the sparse feature vector z is constructed to reconstruct the input x with a dictionary D. The procedure produces a code z* by minimizing the energy function:<br />
<br />
:<math>L(x,z,D) = \frac{1}{2}||x-Dz||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ min}} \ L(x,z,D)</math><br />
<br />
D is obtained by minimizing the above with respect to D: <math>\underset{z,D}{\operatorname{arg\ min}} \ L(x,z,D)</math>, averaged over the training set. The drawbacks to this method are the the representation is redundant and that the inference for a whole image is computationally expensive. The reason is that the system is trained on single image patches in most applications of sparse coding to image analysis, which produces a dictionary of filters that are essentially shifted versions of each other over the patch and reconstruct in isolation.<br />
<br />
This first problem can be addressed by applying sparse coding to the entire image and treating the dictionary as a convolutional filter bank.<br />
<br />
:<math>L(x,z,D) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + |z|_1</math><br />
<br />
Where D<sub>k</sub> is an s×s filter kernel, x is a w×h image, z<sub>k</sub> is a feature map of dimension (w+s-1)×(h+s-1), and * denotes the discrete convolution operator.<br />
<br />
The second problem can be addressed by using a trainable feed-forward encoder to approximate the sparse code:<br />
<br />
:<math>L(z,z,D,W) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + \sum_{k=1}^K||z_k - f(W^k*x)||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ max}} \ L(x,z,D,W) </math><br />
<br />
Where W<sup>k</sup> is an encoding convolutional kernel of size s×s, and f is a point-wise non-linear function. Both the form of f and the method to find z* are discussed below.<br />
<br />
The contribution of this paper is to address these two issues simultaneously, thus allowing convolutional approaches to sparse coding.<br />
<br />
=Method=<br />
<br />
The authors extend the coordinate descent sparse coding algorithm detailed in <ref>Li, Y and Osher, S. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.</ref> to use convolutional methods.<br />
<br />
Two considerations for learning convolution dictionaries are:<br />
#Boundary effects due to convolution must be handled.<br />
#Derivatives should be calculated efficiently.<br />
<br />
----<br />
'''function ConvCoD'''<math>\, (x,D,\alpha)</math><br />
<br />
:'''Set:''' <math>\, S = D^T*D</math><br />
<br />
:'''Initalize:''' <math>\, z = 0;\ \beta = D^T * mask(x)</math><br />
<br />
:'''Require:''' <math>\, h_\alpha:</math>: smooth thresholding function<br />
<br />
:'''repeat'''<br />
<br />
::<math>\, \bar{x} = h_\alpha(\beta)</math><br />
<br />
::<math>\, (k,p,q) = \underset{i,m,n}{\operatorname{arg\ max}} |z_{imn}-\bar{z_{imn}}|</math> (k: dictionary index, (p,q) location index)<br />
<br />
::<math>\, bi = \beta_{kpq}</math><br />
<br />
::<math>\, \beta = \beta + (z_kpg - \bar{z_{kpg}}) \times align(S(:,k,:,:),(p,q))</math> **<br />
<br />
::<math>\, z_{kpg} = \bar{z_{kpg}},\ \beta_{kpg} = bi</math><br />
<br />
:'''until''' change in <math>z</math> is below a threshold<br />
<br />
:'''end function'''<br />
----<br />
<nowiki>**</nowiki> MATLAB notation is used for slicing the tensor.<br />
<br />
In the above, <math>\beta = D^T * mask(x)</math> is use to handle boundary effects, where mask operates term by term and either puts zeros or scales down the boundaries.<br />
<br />
The learning procedure is then stochastic gradient descent over the dictionary D, where the columns of D are normalized after each iteration.<br />
<br />
:<math>\forall x^i \in X</math> training set: <math>z* = \underset{z}{\operatorname{arg\ max}}\ L(x^i,z,d), D \leftarrow D - \eta \frac {\partial(L,x^i,z^*,D}{\partial D}</math><br />
<br />
Two encoder architectures are tested. The first is steepest descent sparse coding with tanh encoding function using <math>g^k \times tanh(x*W^k)</math>.<br />
<br />
The second is convolutional CoD sparse coding with a smooth shrinkage operator as defined below. <br />
<br />
:<math>\tilde{z}=sh_{\beta^k,b^k}(x*W^k)</math> where k = 1..K.<br />
<br />
:<math>sh_{\beta^k,b^k}(s) = sign(s) \times 1/\beta^k \log(\exp(\beta^k \times |s|) - 1) - b^k</math><br />
<br />
=Experiments=<br />
<br />
Two systems are used:<br />
#Steepest descent sparse coding with tanh encoder: <math>SD^{tanh}</math><br />
#Coordinate descent sparse coding with shrink encoder: <math>CD^{shrink}</math><br />
<br />
==Object Recognition using Caltech-101 Dataset==<br />
<br />
In the Caltech-101 dataset, each image contains a single object. Each image is processed by converting to grayscale and resizing, followed by contrast normalization. All results use 30 training samples per class and 5 different choices of the training set.<br />
<br />
''Architecture:'' 64 features are extracted by the first layer, followed by a second layer that produces 256 features. Second layer features are connected to first layer features by a sparse connection table.<br />
<br />
''First Layer:'' Both systems are trained using 64 dictionary elements, whether each dictionary item is a 9×9 convolution kernel. Both systems are trained for 10 sparsity values from 0.1-3.0.<br />
<br />
''Second Layer:'' In the second layer, each of 256 feature maps in connected to 16 randomly selected input features from the first layer.<br />
<br />
''One Stage System:'' In these results, the input is passed to the first layer, followed by absolute value rectification, contrast normalization, and average pooling. The output of the first layer is fed to a logistic classifier followed by the PMK-SVN classifier used in <ref>Lazebnik, S, Schmid, C, and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06, 2:2169–2178, 2006.</ref>.<br />
<br />
''Two Stage System:'' These results use both layers, followed by absolute value rectification, contrast normalization, and average pooling. Finally, a multinomial logistic regression classifier is used.<br />
<br />
[[File:CoD_results.png]]<br />
<br />
In the above, U represents one stage, UU represents two stages, and '+' represents supervise training is performed afterwards.<br />
<br />
==Pedestrian Detection==<br />
<br />
The architecture is trained and evaluated on the INRIA Pedestrian dataset <ref>Dalal, N and Triggs, B. Histograms of oriented gradients for human detection. In Schmid, C, Soatto, S, and Tomasi, C, editors, CVPR’05, volume 2, pages 886–893, June 2005.</ref> which contains 2416 positive examples (after mirroring) and 1218 negative full images. For training, the dataset is augmented with minor translations and scaling, giving a total of 11370 examples for training and 1000 images for classification. The negative examples are augmented with larger scale variations to avoid false positives, giving a total of 9001 samples for training and 1000 for validation.<br />
<br />
The architecture for the pedestrian detection task is similar to that describe in the previous section. It was trained both with and without unsupervised initialization, followed by supervised training. After one pass of training, the negative set was augmented with the 10 most offending samples on each full negative image.<br />
<br />
[[File:CoD_pedestrian_results.png]]<br />
<br />
=Discussion=<br />
*The paper presented an efficient method for convolutional training of feature extractors.<br />
*The resulting features look intuitively better than those obtained through non-convolutional methods, but classification results are only slightly better (where they're better at all) than existing methods.<br />
*It's not clear what effects in the pedestrian experiment are due to the method of preprocessing and variations on the dataset (scaling and translation) and which are due to the architecture itself. Comparisons are with other systems that processed input differently.<br />
<br />
=References=<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=27100deep Convolutional Neural Networks For LVCSR2015-12-06T00:49:33Z<p>X435liu: /* Results with Proposed Architecture */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
A slight improvement can be obtained by using 128 hidden units for the first convolutional layer and 256 for the second layer, which uses more hidden units in the convolutional layers, as many hidden units are needed to capture the locality differences between different frequency regions in speech.<br />
<br />
== Optimal Feature Set ==<br />
We should note that the Linear Discriminant Analysis (LDA) cannot be used with CNNs because it removes local correlation in frequency. So they use Mel filter-bank (FB) features which exhibit this locality property.<br />
<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The optimal architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
Broadcast News consists of 400 hours of speech data and it was used for training. DARPA EARS rt04 and def04f datasets were used for testing. The following table shows that CNN-based features offer 13-18% relative improvment over GMM/HMM system and 10-12% over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
<br />
== Switchboard ==<br />
<br />
Switchboard dataset is a 300 hours of conversational American English telephony data. Hub5'00 dataset is used as validation set, while rt03 set is used for testing. Switchboard (SWB) and Fisher (FSH) are portions of the set, and the results are reported separately for each set. Three systems, as shown in the following table, were used in comparisons. CNN-based features over 13-33% relative improvement over GMM/HMM system, and 4-7% relative improvement over hybrid DNN system. These results show that CNNs are superior to both GMMs and DNNs.<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
<br />
= Conclusions and Discussions =<br />
<br />
In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems.<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=27099deep Convolutional Neural Networks For LVCSR2015-12-06T00:46:32Z<p>X435liu: /* Optimal Feature Set */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
A slight improvement can be obtained by using 128 hidden units for the first convolutional layer and 256 for the second layer, which uses more hidden units in the convolutional layers, as many hidden units are needed to capture the locality differences between different frequency regions in speech.<br />
<br />
== Optimal Feature Set ==<br />
We should note that the Linear Discriminant Analysis (LDA) cannot be used with CNNs because it removes local correlation in frequency. So they use Mel filter-bank (FB) features which exhibit this locality property.<br />
<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
Broadcast News consists of 400 hours of speech data and it was used for training. DARPA EARS rt04 and def04f datasets were used for testing. The following table shows that CNN-based features offer 13-18% relative improvment over GMM/HMM system and 10-12% over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
<br />
== Switchboard ==<br />
<br />
Switchboard dataset is a 300 hours of conversational American English telephony data. Hub5'00 dataset is used as validation set, while rt03 set is used for testing. Switchboard (SWB) and Fisher (FSH) are portions of the set, and the results are reported separately for each set. Three systems, as shown in the following table, were used in comparisons. CNN-based features over 13-33% relative improvement over GMM/HMM system, and 4-7% relative improvement over hybrid DNN system. These results show that CNNs are superior to both GMMs and DNNs.<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
<br />
= Conclusions and Discussions =<br />
<br />
In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems.<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=27098deep Convolutional Neural Networks For LVCSR2015-12-06T00:33:17Z<p>X435liu: /* Number of Hidden Units */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
A slight improvement can be obtained by using 128 hidden units for the first convolutional layer and 256 for the second layer, which uses more hidden units in the convolutional layers, as many hidden units are needed to capture the locality differences between different frequency regions in speech.<br />
<br />
== Optimal Feature Set ==<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
Broadcast News consists of 400 hours of speech data and it was used for training. DARPA EARS rt04 and def04f datasets were used for testing. The following table shows that CNN-based features offer 13-18% relative improvment over GMM/HMM system and 10-12% over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
<br />
== Switchboard ==<br />
<br />
Switchboard dataset is a 300 hours of conversational American English telephony data. Hub5'00 dataset is used as validation set, while rt03 set is used for testing. Switchboard (SWB) and Fisher (FSH) are portions of the set, and the results are reported separately for each set. Three systems, as shown in the following table, were used in comparisons. CNN-based features over 13-33% relative improvement over GMM/HMM system, and 4-7% relative improvement over hybrid DNN system. These results show that CNNs are superior to both GMMs and DNNs.<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
<br />
= Conclusions and Discussions =<br />
<br />
In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems.<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27097very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-05T23:48:24Z<p>X435liu: </p>
<hr />
<div>= Introduction =<br />
<br />
In this paper the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
= Classification Framework =<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
Training:<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively less epochs to converge due to the following reasons:<br />
(a) implicit regularisation imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialisation of certain layers.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
It took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
Testing:<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
= Classification Experiments =<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
== Single-Scale Evaluation ==<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
== Multi-Scale Evaluation ==<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Comparison With The State Of The Art ==<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
= Appendix A: Localization =<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. They also showed that their configuration is applicable to some other datasets.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27096very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-05T23:47:08Z<p>X435liu: /* Classification Experiments */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
== Classification Framework==<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
Training:<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively less epochs to converge due to the following reasons:<br />
(a) implicit regularisation imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialisation of certain layers.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
It took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
Testing:<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
==Classification Experiments==<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
=== Single-Scale Evaluation ===<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
=== Multi-Scale Evaluation ===<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
=== Comparison With The State Of The Art ===<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
== Appendix A: Localization==<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. They also showed that their configuration is applicable to some other datasets.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ConvNet3.PNG&diff=27095File:ConvNet3.PNG2015-12-05T23:46:45Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27094very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-05T23:42:00Z<p>X435liu: /* Classification Experiments */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
== Classification Framework==<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
Training:<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively less epochs to converge due to the following reasons:<br />
(a) implicit regularisation imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialisation of certain layers.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
It took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
Testing:<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
==Classification Experiments==<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
=== Single-Scale Evaluation ===<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
<br />
=== Multi-Scale Evaluation ===<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Appendix A: Localization==<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. They also showed that their configuration is applicable to some other datasets.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27093very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-05T23:41:19Z<p>X435liu: /* Classification Experiments */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
== Classification Framework==<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
Training:<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively less epochs to converge due to the following reasons:<br />
(a) implicit regularisation imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialisation of certain layers.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
It took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
Testing:<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
==Classification Experiments==<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
Single-Scale Evaluation:<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
<br />
Multi-Scale Evaluation<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Appendix A: Localization==<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. They also showed that their configuration is applicable to some other datasets.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ConvNet2.PNG&diff=27092File:ConvNet2.PNG2015-12-05T23:39:03Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27091very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-05T23:38:13Z<p>X435liu: /* Classification Experiments */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
== Classification Framework==<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
Training:<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively less epochs to converge due to the following reasons:<br />
(a) implicit regularisation imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialisation of certain layers.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
It took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
Testing:<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
==Classification Experiments==<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
Single-Scale Evaluation:<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
<br />
Multi-Scale Evaluation<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
== Appendix A: Localization==<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. They also showed that their configuration is applicable to some other datasets.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ConvNet1.PNG&diff=27090File:ConvNet1.PNG2015-12-05T23:37:06Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27089very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-05T23:35:03Z<p>X435liu: /* Conv.Net Configurations */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
== Classification Framework==<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
Training:<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively less epochs to converge due to the following reasons:<br />
(a) implicit regularisation imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialisation of certain layers.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
It took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
Testing:<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
==Classification Experiments==<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
Single-Scale Evaluation:<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
<br />
Multi-Scale Evaluation<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
== Appendix A: Localization==<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. They also showed that their configuration is applicable to some other datasets.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27088very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-05T23:25:41Z<p>X435liu: /* Conclusion */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
== Classification Framework==<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
Training:<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. <br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively less epochs to converge due to the following reasons:<br />
(a) implicit regularisation imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialisation of certain layers.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
Testing:<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
==Classification Experiments==<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
Single-Scale Evaluation:<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
<br />
Multi-Scale Evaluation<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
== Appendix A: Localization==<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. They also showed that their configuration is applicable to some other datasets.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27087deep Sparse Rectifier Neural Networks2015-12-05T22:07:57Z<p>X435liu: /* Results */</p>
<hr />
<div>= Introduction =<br />
<br />
Two trends in Deep Learning can be seen in terms of architecture improvements. The first is increasing sparsity (for example, see convolutional neural nets) and increasing biological plausibility (biologically plausible sigmoid neurons performing better than tanh neurons). Rectified linear neurons are good for sparsity and for biological plausibility, thus should increase performance.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Advantages of rectified linear units ==<br />
<br />
The rectifier activation function max(0, x) allows a network to easily obtain sparse representations. For a given input, if this subset of neurons is selected, the output is a linear function of the input, which means gradients can be spread well on the active paths of neurons and mathematical investigation is easier.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used.<br />
<br />
Finally, rectifier networks are subject to ill conditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
For image recognition task, they find that there is almost no improvement when using unsupervised pre-training with rectifier activations, contrary to what is experienced using tanh or softplus. However, it achieves best performance when the network is trained Without unsupervised pre-training.<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27086deep Sparse Rectifier Neural Networks2015-12-05T22:01:07Z<p>X435liu: </p>
<hr />
<div>= Introduction =<br />
<br />
Two trends in Deep Learning can be seen in terms of architecture improvements. The first is increasing sparsity (for example, see convolutional neural nets) and increasing biological plausibility (biologically plausible sigmoid neurons performing better than tanh neurons). Rectified linear neurons are good for sparsity and for biological plausibility, thus should increase performance.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Advantages of rectified linear units ==<br />
<br />
The rectifier activation function max(0, x) allows a network to easily obtain sparse representations. For a given input, if this subset of neurons is selected, the output is a linear function of the input, which means gradients can be spread well on the active paths of neurons and mathematical investigation is easier.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used.<br />
<br />
Finally, rectifier networks are subject to ill conditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27085deep Sparse Rectifier Neural Networks2015-12-05T21:48:46Z<p>X435liu: </p>
<hr />
<div>= Introduction =<br />
<br />
Two trends in Deep Learning can be seen in terms of architecture improvements. The first is increasing sparsity (for example, see convolutional neural nets) and increasing biological plausibility (biologically plausible sigmoid neurons performing better than tanh neurons). Rectified linear neurons are good for sparsity and for biological plausibility, thus should increase performance.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27084deep Sparse Rectifier Neural Networks2015-12-05T21:48:21Z<p>X435liu: </p>
<hr />
<div>= Introduction =<br />
<br />
Two trends in Deep Learning can be seen in terms of architecture improvements. The first is increasing sparsity (for example, see convolutional neural nets) and increasing biological plausibility (biologically plausible sigmoid neurons performing better than tanh neurons). Rectified linear neurons are good for sparsity and for biological plausibility, thus should increase performance.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
== Bibliography ==<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27083deep Sparse Rectifier Neural Networks2015-12-05T21:47:36Z<p>X435liu: /* Biological Plausibility and Sparsity */</p>
<hr />
<div>= Introduction =<br />
<br />
Two trends in Deep Learning can be seen in terms of architecture improvements. The first is increasing sparsity (for example, see convolutional neural nets) and increasing biological plausibility (biologically plausible sigmoid neurons performing better than tanh neurons). Rectified linear neurons are good for sparsity and for biological plausibility, thus should increase performance.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27082deep Sparse Rectifier Neural Networks2015-12-05T21:34:03Z<p>X435liu: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Two trends in Deep Learning can be seen in terms of architecture improvements. The first is increasing sparsity (for example, see convolutional neural nets) and increasing biological plausibility (biologically plausible sigmoid neurons performing better than tanh neurons). Rectified linear neurons are good for sparsity and for biological plausibility, thus should increase performance.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value.<br />
<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26439dropout2015-11-18T03:04:50Z<p>X435liu: /* Result */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26438dropout2015-11-18T02:55:49Z<p>X435liu: /* Result */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Dropout.PNG&diff=26437File:Dropout.PNG2015-11-18T02:55:12Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26436dropout2015-11-18T02:48:34Z<p>X435liu: /* Effects of Dropout */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26435dropout2015-11-18T02:47:56Z<p>X435liu: /* Effects of Dropout */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data.<br />
<br />
Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26415dropout2015-11-17T22:43:52Z<p>X435liu: /* Model */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=26237joint training of a convolutional network and a graphical model for human pose estimation2015-11-14T02:48:11Z<p>X435liu: </p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
They combine an efficient sliding window-based architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.<br />
<br />
[[File:architecture1.PNG | center]]<br />
<br />
First, a Laplacian Pyramid<ref><br />
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]<br />
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref><br />
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).<br />
</ref>) is applied to those input images. For each resolution bank, sliding-window ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.<br />
<br />
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.<br />
<br />
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.<br />
<br />
They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:<br />
<br />
<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math><br />
<br />
Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should to be removed from the graph structure.<br />
<br />
For their practical implementation they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is<br />
<br />
<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math><br />
<br />
<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math><br />
<br />
<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math><br />
<br />
With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.<br />
<br />
[[File:architecture2.PNG | center]]<br />
<br />
The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref><br />
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.<br />
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.<br />
<br />
=== Unified Model ===<br />
<br />
They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.<br />
<br />
== Results ==<br />
<br />
They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref><br />
[http://cims.nyu.edu/~tompson/flic_plus.htm "FLIC-plus Dataset"]<br />
</ref> which is fairer than FLIC-full dataset.<br />
<br />
Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.<br />
<br />
[[File:result1.PNG | center]]<br />
<br />
Performance on the LSP dataset is shown here.<br />
<br />
[[File:result2.PNG | center]]<br />
<br />
Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.<br />
<br />
== Bibliography ==<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=26236joint training of a convolutional network and a graphical model for human pose estimation2015-11-14T02:45:31Z<p>X435liu: /* Higher-Level Spatial-Model */</p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
They combine an efficient sliding window-based architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.<br />
<br />
[[File:architecture1.PNG | center]]<br />
<br />
First, a Laplacian Pyramid<ref><br />
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]<br />
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref><br />
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).<br />
</ref>) is applied to those input images. For each resolution bank, sliding-window ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.<br />
<br />
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.<br />
<br />
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.<br />
<br />
They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:<br />
<br />
<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math><br />
<br />
Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should to be removed from the graph structure.<br />
<br />
For their practical implementation they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is<br />
<br />
<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math><br />
<br />
<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math><br />
<br />
<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math><br />
<br />
With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.<br />
<br />
[[File:architecture2.PNG | center]]<br />
<br />
The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref><br />
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.<br />
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.<br />
<br />
=== Unified Model ===<br />
<br />
They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.<br />
<br />
== Results ==<br />
<br />
They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref><br />
[http://cims.nyu.edu/~tompson/flic_plus.htm "FLIC-plus Dataset"]<br />
</ref> which is fairer than FLIC-full dataset.<br />
<br />
Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.<br />
<br />
[[File:result1.PNG | center]]<br />
<br />
Performance on the LSP dataset is shown here.<br />
<br />
[[File:result2.PNG | center]]<br />
<br />
Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=26235joint training of a convolutional network and a graphical model for human pose estimation2015-11-14T02:43:42Z<p>X435liu: /* Results */</p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
They combine an efficient sliding window-based architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.<br />
<br />
[[File:architecture1.PNG | center]]<br />
<br />
First, a Laplacian Pyramid<ref><br />
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]<br />
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref><br />
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).<br />
</ref>) is applied to those input images. For each resolution bank, sliding-window ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.<br />
<br />
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.<br />
<br />
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.<br />
<br />
They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:<br />
<br />
<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math><br />
<br />
Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should to be removed from the graph structure.<br />
<br />
For their practical implementation they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is<br />
<br />
<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math><br />
<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math><br />
<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math><br />
<br />
With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.<br />
<br />
[[File:architecture2.PNG | center]]<br />
<br />
The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref><br />
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.<br />
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.<br />
<br />
=== Unified Model ===<br />
<br />
They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.<br />
<br />
== Results ==<br />
<br />
They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref><br />
[http://cims.nyu.edu/~tompson/flic_plus.htm "FLIC-plus Dataset"]<br />
</ref> which is fairer than FLIC-full dataset.<br />
<br />
Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.<br />
<br />
[[File:result1.PNG | center]]<br />
<br />
Performance on the LSP dataset is shown here.<br />
<br />
[[File:result2.PNG | center]]<br />
<br />
Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Result2.PNG&diff=26234File:Result2.PNG2015-11-14T02:42:55Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Result1.PNG&diff=26233File:Result1.PNG2015-11-14T02:37:05Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=26232joint training of a convolutional network and a graphical model for human pose estimation2015-11-14T02:33:55Z<p>X435liu: /* Unified Model */</p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
They combine an efficient sliding window-based architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.<br />
<br />
[[File:architecture1.PNG | center]]<br />
<br />
First, a Laplacian Pyramid<ref><br />
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]<br />
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref><br />
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).<br />
</ref>) is applied to those input images. For each resolution bank, sliding-window ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.<br />
<br />
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.<br />
<br />
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.<br />
<br />
They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:<br />
<br />
<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math><br />
<br />
Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should to be removed from the graph structure.<br />
<br />
For their practical implementation they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is<br />
<br />
<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math><br />
<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math><br />
<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math><br />
<br />
With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.<br />
<br />
[[File:architecture2.PNG | center]]<br />
<br />
The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref><br />
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.<br />
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.<br />
<br />
=== Unified Model ===<br />
<br />
They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.<br />
<br />
== Results ==</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=26231joint training of a convolutional network and a graphical model for human pose estimation2015-11-14T02:33:31Z<p>X435liu: /* Higher-Level Spatial-Model */</p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
They combine an efficient sliding window-based architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.<br />
<br />
[[File:architecture1.PNG | center]]<br />
<br />
First, a Laplacian Pyramid<ref><br />
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]<br />
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref><br />
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).<br />
</ref>) is applied to those input images. For each resolution bank, sliding-window ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.<br />
<br />
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.<br />
<br />
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.<br />
<br />
They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:<br />
<br />
<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math><br />
<br />
Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should to be removed from the graph structure.<br />
<br />
For their practical implementation they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is<br />
<br />
<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math><br />
<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math><br />
<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math><br />
<br />
With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.<br />
<br />
[[File:architecture2.PNG | center]]<br />
<br />
The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref><br />
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.<br />
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.<br />
<br />
=== Unified Model ===<br />
<br />
<br />
<br />
<br />
== Results ==</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Architecture2.PNG&diff=26230File:Architecture2.PNG2015-11-14T02:29:58Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=26229joint training of a convolutional network and a graphical model for human pose estimation2015-11-14T02:09:46Z<p>X435liu: /* Convolutional Network Part-Detector */</p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
They combine an efficient sliding window-based architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.<br />
<br />
[[File:architecture1.PNG | center]]<br />
<br />
First, a Laplacian Pyramid<ref><br />
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]<br />
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref><br />
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).<br />
</ref>) is applied to those input images. For each resolution bank, sliding-window ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.<br />
<br />
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.<br />
<br />
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
<br />
<br />
=== Unified Model ===<br />
<br />
<br />
<br />
<br />
== Results ==</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Architecture1.PNG&diff=26228File:Architecture1.PNG2015-11-14T02:00:54Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=26227joint training of a convolutional network and a graphical model for human pose estimation2015-11-14T01:58:47Z<p>X435liu: Created page with "== Introduction == Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Rece..."</p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
<br />
<br />
=== Unified Model ===<br />
<br />
<br />
<br />
<br />
== Results ==</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=26226f15Stat946PaperSignUp2015-11-14T01:53:24Z<p>X435liu: /* Set B */</p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/abs/10.1021/ci500747n.pdf Paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || || Towards AI-complete question answering: a set of prerequisite toy tasks || [http://arxiv.org/pdf/1502.05698.pdf Paper] ||<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]||<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] ||<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] ||<br />
|-<br />
|Tim Tse|| || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| 19 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|-<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]<br />
|-<br />
|Mahmood Gohari||37 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]<br />
|-<br />
|Valerie Platsko|| || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=imageNet_Classification_with_Deep_Convolutional_Neural_Networks&diff=26115imageNet Classification with Deep Convolutional Neural Networks2015-11-11T20:44:18Z<p>X435liu: /* Overall Architecture */</p>
<hr />
<div>== Introduction ==<br />
<br />
In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.<br />
<br />
Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.<br />
<br />
The code of their work is available here<ref><br />
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]<br />
</ref>.<br />
<br />
== Dataset ==<br />
<br />
ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has roughly 1.2 million labeled high-resolution training images, 50 thousand validation images, and 150 thousand testing images over 1000 categories.<br />
<br />
In this paper, the images in this dataset are down-sampled to a fixed resolution of 256 x 256. The only image pre-processing they used is subtracting the mean activity over the training set from each pixel.<br />
<br />
== Architecture ==<br />
<br />
=== ReLU Nonlinearity ===<br />
<br />
They use Rectified Linear Units (ReLUs)<ref><br />
Nair V, Hinton G E. [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf Rectified linear units improve restricted boltzmann machines.] Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010: 807-814.<br />
</ref> as the nonlinearity function, which work several times faster than equivalents with those standard saturating neurons. Thus, better performance can be achieved by reducing the training time for each epoch and training larger datasets to prevent overfitting.<br />
<br />
=== Training on Multiple GPUs ===<br />
<br />
They spread the net across two GPUs by putting half of the kernels (or neurons) on each GPU and letting GPUs communicate only in certain layers. Choosing the pattern of connectivity could be a problem for cross-validation, so they tune the amount of communication precisely until it is an acceptable fraction of the amount of computation.<br />
<br />
=== Local Response Normalization ===<br />
<br />
ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. However, they find that a local response normalization scheme after applying the ReLU nonlinearity can reduce their top-1 and top-5 error rates by 1.4% and 1.2%.<br />
<br />
The response normalization is given by the expression<br />
<br />
<math>b_{x,y}^{i}=a_{x,y}^{i}/\left ( k+\alpha \sum_{j=max\left ( 0,i-n/2 \right )}^{min\left ( N-1,i+n/2 \right )}\left ( a_{x,y}^{i} \right )^{2} \right )^{\beta }</math><br />
<br />
where the sum runs over n “adjacent” kernel maps at the same spatial position. This response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.<br />
<br />
=== Overlapping Pooling ===<br />
<br />
Unlike traditional non-overlapping pooling, they use overlapping pooling throughout their network, with pooling window size z = 3 and stride s = 2. This scheme reduces their top-1 and top-5 error rates by 0.4% and 0.3% and makes the network more difficult to overfit.<br />
<br />
=== Overall Architecture ===<br />
<br />
[[File:network.JPG | center]]<br />
<br />
As shown in the figure above, the net contains eight layers with 60 million parameters; the first five are convolutional and the remaining three are fully connected layers. The output of the last layer is fed to a 1000-way softmax. Their network maximizes the average across training cases of the log-probability of the correct label under the prediction distribution.<br />
<br />
Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.<br />
<br />
== Reducing overfitting ==<br />
<br />
=== Data Augmentation ===<br />
<br />
The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. In this paper, the transformed images are generated on CPU while GPU is training and do not need to be stored on disk.<br />
<br />
They extract a random 224 x 224 patches (and their horizontal reflections) from the 256 x 256 images and training the network on these extracted patches. They also perform principal components analysis (PCA) on the set of RGB pixel values. This scheme helps to capture the object identity invariant with respect to its intensity and color, which reduces the top-1 error rate by over 1%.<br />
<br />
=== Dropout ===<br />
<br />
The “dropout” technique is implemented in the first two fully-connected layers by setting to zero the output of each hidden neuron with probability 0.5. This scheme roughly doubles the number of iterations required to converge. However, it forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.<br />
<br />
== Details of leaning ==<br />
<br />
They trained the network using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. The update rule for weight w was<br />
<br />
<math>v_{i+1}:=0.9\cdot v_{i}-0.0005\cdot \epsilon \cdot w_{i}-\epsilon \cdot \left \langle \frac{\partial L}{\partial w}|_{w_{i}} \right \rangle_{D_{i}}</math><br />
<br />
<math>w_{i+1}:=w_{i}+v_{i+1}</math><br />
<br />
where <math>v</math> is the momentum variable, <math>\epsilon</math> is the learning rate which is adjusted manually throughout training. The weights in each layer are initialized from a zero-mean Gaussian distribution with standard deviation 0.01. The biases in the second, fourth, fifth convolutional layers and fully-connected hidden layers are initialized by 1, while those in the remaining layers are set by 0.<br />
<br />
== Results ==<br />
<br />
For ILSVRC-2010 dataset, their network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%, which was the state of the art at that time.<br />
<br />
For LSVRC-2012 dataset, the CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%.<br />
<br />
== Discussion ==<br />
<br />
1. It is notable that our network’s performance degrades if a single convolutional layer is removed. So the depth of the network is important for achieving their results.<br />
<br />
2. Their experiments suggest that the results can be improved simply by waiting for faster GPUs and bigger datasets to become available.<br />
<br />
== Bibliography ==<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Network.JPG&diff=26114File:Network.JPG2015-11-11T20:43:59Z<p>X435liu: uploaded a new version of &quot;File:Network.JPG&quot;</p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=imageNet_Classification_with_Deep_Convolutional_Neural_Networks&diff=26113imageNet Classification with Deep Convolutional Neural Networks2015-11-11T20:38:14Z<p>X435liu: </p>
<hr />
<div>== Introduction ==<br />
<br />
In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.<br />
<br />
Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.<br />
<br />
The code of their work is available here<ref><br />
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]<br />
</ref>.<br />
<br />
== Dataset ==<br />
<br />
ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has roughly 1.2 million labeled high-resolution training images, 50 thousand validation images, and 150 thousand testing images over 1000 categories.<br />
<br />
In this paper, the images in this dataset are down-sampled to a fixed resolution of 256 x 256. The only image pre-processing they used is subtracting the mean activity over the training set from each pixel.<br />
<br />
== Architecture ==<br />
<br />
=== ReLU Nonlinearity ===<br />
<br />
They use Rectified Linear Units (ReLUs)<ref><br />
Nair V, Hinton G E. [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf Rectified linear units improve restricted boltzmann machines.] Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010: 807-814.<br />
</ref> as the nonlinearity function, which work several times faster than equivalents with those standard saturating neurons. Thus, better performance can be achieved by reducing the training time for each epoch and training larger datasets to prevent overfitting.<br />
<br />
=== Training on Multiple GPUs ===<br />
<br />
They spread the net across two GPUs by putting half of the kernels (or neurons) on each GPU and letting GPUs communicate only in certain layers. Choosing the pattern of connectivity could be a problem for cross-validation, so they tune the amount of communication precisely until it is an acceptable fraction of the amount of computation.<br />
<br />
=== Local Response Normalization ===<br />
<br />
ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. However, they find that a local response normalization scheme after applying the ReLU nonlinearity can reduce their top-1 and top-5 error rates by 1.4% and 1.2%.<br />
<br />
The response normalization is given by the expression<br />
<br />
<math>b_{x,y}^{i}=a_{x,y}^{i}/\left ( k+\alpha \sum_{j=max\left ( 0,i-n/2 \right )}^{min\left ( N-1,i+n/2 \right )}\left ( a_{x,y}^{i} \right )^{2} \right )^{\beta }</math><br />
<br />
where the sum runs over n “adjacent” kernel maps at the same spatial position. This response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.<br />
<br />
=== Overlapping Pooling ===<br />
<br />
Unlike traditional non-overlapping pooling, they use overlapping pooling throughout their network, with pooling window size z = 3 and stride s = 2. This scheme reduces their top-1 and top-5 error rates by 0.4% and 0.3% and makes the network more difficult to overfit.<br />
<br />
=== Overall Architecture ===<br />
<br />
<gallery><br />
Image:Network.jpg|The network architecture<br />
</gallery><br />
<br />
As shown in the figure above, the net contains eight layers with 60 million parameters; the first five are convolutional and the remaining three are fully connected layers. The output of the last layer is fed to a 1000-way softmax. Their network maximizes the average across training cases of the log-probability of the correct label under the prediction distribution.<br />
<br />
Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.<br />
<br />
== Reducing overfitting ==<br />
<br />
=== Data Augmentation ===<br />
<br />
The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. In this paper, the transformed images are generated on CPU while GPU is training and do not need to be stored on disk.<br />
<br />
They extract a random 224 x 224 patches (and their horizontal reflections) from the 256 x 256 images and training the network on these extracted patches. They also perform principal components analysis (PCA) on the set of RGB pixel values. This scheme helps to capture the object identity invariant with respect to its intensity and color, which reduces the top-1 error rate by over 1%.<br />
<br />
=== Dropout ===<br />
<br />
The “dropout” technique is implemented in the first two fully-connected layers by setting to zero the output of each hidden neuron with probability 0.5. This scheme roughly doubles the number of iterations required to converge. However, it forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.<br />
<br />
== Details of leaning ==<br />
<br />
They trained the network using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. The update rule for weight w was<br />
<br />
<math>v_{i+1}:=0.9\cdot v_{i}-0.0005\cdot \epsilon \cdot w_{i}-\epsilon \cdot \left \langle \frac{\partial L}{\partial w}|_{w_{i}} \right \rangle_{D_{i}}</math><br />
<br />
<math>w_{i+1}:=w_{i}+v_{i+1}</math><br />
<br />
where <math>v</math> is the momentum variable, <math>\epsilon</math> is the learning rate which is adjusted manually throughout training. The weights in each layer are initialized from a zero-mean Gaussian distribution with standard deviation 0.01. The biases in the second, fourth, fifth convolutional layers and fully-connected hidden layers are initialized by 1, while those in the remaining layers are set by 0.<br />
<br />
== Results ==<br />
<br />
For ILSVRC-2010 dataset, their network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%, which was the state of the art at that time.<br />
<br />
For LSVRC-2012 dataset, the CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%.<br />
<br />
== Discussion ==<br />
<br />
1. It is notable that our network’s performance degrades if a single convolutional layer is removed. So the depth of the network is important for achieving their results.<br />
<br />
2. Their experiments suggest that the results can be improved simply by waiting for faster GPUs and bigger datasets to become available.<br />
<br />
== Bibliography ==<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Network.JPG&diff=26112File:Network.JPG2015-11-11T20:23:24Z<p>X435liu: </p>
<hr />
<div></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=imageNet_Classification_with_Deep_Convolutional_Neural_Networks&diff=26111imageNet Classification with Deep Convolutional Neural Networks2015-11-11T20:01:45Z<p>X435liu: /* Dataset */</p>
<hr />
<div>== Introduction ==<br />
<br />
In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.<br />
<br />
Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.<br />
<br />
The code of their work is available here<ref><br />
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]<br />
</ref>.<br />
<br />
== Dataset ==<br />
<br />
ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has roughly 1.2 million labeled high-resolution training images, 50 thousand validation images, and 150 thousand testing images over 1000 categories.<br />
<br />
In this paper, the images in this dataset are down-sampled to a fixed resolution of 256 x 256. The only image pre-processing they used is subtracting the mean activity over the training set from each pixel.<br />
<br />
== Architecture ==<br />
<br />
<br />
== Reducing overfitting ==<br />
<br />
<br />
<br />
== Details of leaning ==<br />
<br />
<br />
<br />
<br />
== Results ==<br />
<br />
<br />
<br />
== Discussion ==<br />
<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=imageNet_Classification_with_Deep_Convolutional_Neural_Networks&diff=26110imageNet Classification with Deep Convolutional Neural Networks2015-11-11T20:00:19Z<p>X435liu: </p>
<hr />
<div>== Introduction ==<br />
<br />
In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.<br />
<br />
Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.<br />
<br />
The code of their work is available here<ref><br />
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]<br />
</ref>.<br />
<br />
== Dataset ==<br />
<br />
<br />
<br />
== Architecture ==<br />
<br />
<br />
== Reducing overfitting ==<br />
<br />
<br />
<br />
== Details of leaning ==<br />
<br />
<br />
<br />
<br />
== Results ==<br />
<br />
<br />
<br />
== Discussion ==<br />
<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=imageNet_Classification_with_Deep_Convolutional_Neural_Networks&diff=26109imageNet Classification with Deep Convolutional Neural Networks2015-11-11T19:59:30Z<p>X435liu: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.<br />
<br />
Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.<br />
<br />
The code of their work is available here<ref><br />
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]<br />
</ref>.<br />
<br />
== Dataset ==<br />
<br />
<br />
<br />
== Architecture ==<br />
<br />
<br />
== Reducing overfitting ==<br />
<br />
<br />
<br />
== Details of leaning ==<br />
<br />
<br />
<br />
<br />
== Results ==<br />
<br />
<br />
<br />
== Discussion ==</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=imageNet_Classification_with_Deep_Convolutional_Neural_Networks&diff=26108imageNet Classification with Deep Convolutional Neural Networks2015-11-11T19:56:54Z<p>X435liu: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.<br />
<br />
Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.<br />
<br />
The code of their work is available here<ref><br />
http://code.google.com/p/cuda-convnet/<br />
</ref>.<br />
<br />
== Dataset ==<br />
<br />
<br />
<br />
== Architecture ==<br />
<br />
<br />
== Reducing overfitting ==<br />
<br />
<br />
<br />
== Details of leaning ==<br />
<br />
<br />
<br />
<br />
== Results ==<br />
<br />
<br />
<br />
== Discussion ==</div>X435liuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=imageNet_Classification_with_Deep_Convolutional_Neural_Networks&diff=26107imageNet Classification with Deep Convolutional Neural Networks2015-11-11T19:52:51Z<p>X435liu: Created page with "== Introduction == == Dataset == == Architecture == == Reducing overfitting == == Details of leaning == == Results == == Discussion =="</p>
<hr />
<div>== Introduction ==<br />
<br />
<br />
<br />
== Dataset ==<br />
<br />
<br />
<br />
== Architecture ==<br />
<br />
<br />
== Reducing overfitting ==<br />
<br />
<br />
<br />
== Details of leaning ==<br />
<br />
<br />
<br />
<br />
== Results ==<br />
<br />
<br />
<br />
== Discussion ==</div>X435liu