http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Conversion+script&feedformat=atomstatwiki - User contributions [US]2023-02-05T00:17:57ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=semi-supervised_Learning_with_Deep_Generative_Models&diff=27756semi-supervised Learning with Deep Generative Models2017-08-30T13:46:35Z<p>Conversion script: Conversion script moved page Semi-supervised Learning with Deep Generative Models to semi-supervised Learning with Deep Generative Models: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
Large labelled data sets have led to massive improvements in the performance of machine learning algorithms, especially supervised neural networks. However, the world in general is not labelled and there exists a far greater number of unlabelled data than labelled data. A common situation is to have a comparatively small quantity of labelled data paired with a larger amount of unlabelled data. This leads to the idea of a semi-supervised learning model where the unlabelled data is used to prime the model for relevant features and the labels are then learned for classification. A prominent example of this type of model is the restricted Boltzmann machine based Deep Belief Network (DBN). Where layers of RBM are trained to learn unsupervised features of the data and then a final classification layer is applied such that labels can be assigned. <br />
Unsupervised learning techniques sometimes create what is known as a generative model which creates a joint distribution <math>P(x, y)</math> (which can be sampled from). This is contrasted by the supervised discriminative model, which create conditional distributions <math>P(y | x)</math>. The paper combines these two methods to achieve high performance on benchmark tasks and uses deep neural networks in an innovative manner to create a layered semi-supervised classification/generation model.<br />
<br />
= Current Models and Limitations =<br />
<br />
The paper claims that existing unlabelled data models do not scale well for very large sets of unlabelled data. One example that they discuss is the Transductive SVM, which they claim does not scale well and that optimization for them is a problem. Graph based models suffer from sensitivity to their graphical structure which may make them rigid. Finally they briefly discuss other neural network based methods such as the Manifold Tangent Classifier that uses contrastive auto-encoders (CAEs) to deduce the manifold on which data lies. Based on the manifold hypothesis this means that similar data should not lie far from the manifold and they then can use something called TangentProp to train a classifier based on the manifold of the data. <br />
<br />
= Proposed Method =<br />
<br />
Rather than use the methods mentioned above the team suggests that using generative models based on neural networks would be beneficial. Current generative models lack string inference and scalability though. The paper proposes a method that uses variational inference for semi-supervised classification that will employ deep neural networks.<br />
<br />
== Latent Feature Discriminative Model (M1) ==<br />
<br />
The first sub-model that is described is used to model latent variables ('''z''') that embed features of the unlabelled data. Classification for this model is done separately based on the learned features from the unlabelled data. The key to this model is that the non-linear transform to capture features is a deep neural network. The generative model is based on the following equations: <br />
<br />
<div style="text-align: center;"><br />
<math>p(\mathbf{z}) = \mathcal{N}(\mathbf{z}|\mathbf{0,I})</math> <br />
<br />
<math>p(\mathbf{x|z}) = f(\mathbf{x};\mathbf{z,\theta})</math><br />
</div><br />
<br />
f is a likelihood function based on the parameters <math>\theta</math>. The parameters are tuned by a deep neural network. The posterior distribution <math>p(\mathbf{z}|\mathbf{x})</math> is sampled to train an arbitrary classifier for class labels <math> y </math>. Tnis approach offers substantial improvement in the performance of SVMs.<br />
<br />
== Generative Semi-Supervised Model (M2) ==<br />
<br />
The second model is based on a latent variable '''z''' but the class label <math>y</math> is also treated as a latent variable and used for training. If y is available then it is used as a latent variable, but if it is not, then '''z''' is also used. The following equations describe the generative processes where <math>Cat(y|\mathbf{\pi})</math> is some multinominal distribution. f is used similarly to in M1 but with an extra parameter. Classification is treated as inference by integrating over a class of an unlabelled data sample if y is not available which is done usually with the posterior <math>p_{\theta}(y|\mathbf{x})</math>.<br />
<br />
<br />
<div style="text-align: center;"><br />
<math>p(y) = Cat(y|\mathbf{\pi})</math><br />
<br />
<math>p(\mathbf{z}) = \mathcal{N}(\mathbf{z}|\mathbf{0,I})</math><br />
<br />
<math>p_{\theta}(\mathbf{x}|y, \mathbf{z}) = f(\mathbf{x};y,\mathbf{z,\theta})</math><br />
</div><br />
Another way to see this model is as a hybrid continuous-discrete mixture model, where the parameters are shared between the different components of the mixture.<br />
<br />
== Stacked Generative Semi-Supervised Model (M1+M2) == <br />
<br />
The two aforementioned models are concatenated to form the final model. The method in which this works is that M1 is learned while M2 uses the latent variables from model M1 ('''z_1''') as the data as opposed to raw values '''x'''. The following equations describe the entire model. The distributions of <math>p_{\theta}(\mathbf{z1}|y,\mathbf{z2})</math> and <math>p_{\theta}(\mathbf{x|z1})</math> are parametrized as deep neural networks. <br />
<br />
<div style="text-align: center;"><br />
<math>p_{\theta}(\mathbf{x}, y, \mathbf{z1, z2}) = p(y)p(\mathbf{z2})p_{\theta}(\mathbf{z1}|y, \mathbf{z2})p_{\theta}(\mathbf{x|z1})</math><br />
<br />
<br />
</div><br />
<br />
The problems of intractable posterior distributions is solved with the work of Kingma and Welling using variational inference. These inference networks are not described in detail in the paper. The following algorithms show the method in which the optimization for the methods is performed. <br />
<br />
<br />
[[File:Kingma_2014_1.png |centre|thumb|upright=3|]]<br />
<br />
The posterior distributions are, as usual, intractable, but this problem is resolved through the use of a fixed form distribution <math>q_{\phi}(\mathbf{x|z}</math>, with <math>\phi</math> as parameters that approximate <math>p(\mathbf{z|x})</math>. The equation <math>q_{\phi}</math> is created as an inference network, which allows for the computation of global parameters and does not require computation for each individual data point.<br />
<br />
[[File:kingma_2014_4.png |centre|]]<br />
<br />
In the equations, <math>\,\sigma_{\phi}(x)</math> is a vector of standard deviations <math>\,\pi_{\theta}(x)</math> is a probability vector, and the functions <math>\,\mu_{\phi}(x), \sigma_{\phi}(x) </math> and <math> \,\pi_{\theta}(x)</math> are treated as MLPs for optimization.<br />
<br />
The above algorithm is not more computationally expensive than approaches based on autoencoders or neural models, and has the advantage of being fully probabilistic. The complexity of a single joint update of M<sub>1</sub> can be written as C<sub>M1</sub> = MSC<sub>MLP</sub> where M is the batch size, S is the number of samples of ε and C<sub>MLP</sub> has the form O(KD<sup>2</sup>), where K is the number of layers in the model and D is the average dimension of the layers. The complexity for M<sub>2</sub> has the form LC<sub>M1</sub> where L is the number of labels. All above models can be trained with any of EM algorithm, stochastic gradient variational Bayes, or stochastic backpropagation methods.<br />
<br />
= Results =<br />
<br />
The complexity of M1 can be estimated by using the complexity of the MLP used for the parameters which is equal to <math>C_{MLP} = O(KD^2)</math> with K is the number of layers and D is the average of the neurons in each layer of the network. The total complexity is <math>C_{M1}=MSC_{MLP}</math> with M = size of the mini-batch and S is the number of samples. Similarly the complexity of M2 is <math>C_{M2}=LC_{M1}</math>, where L is the number of labels. Therefore the combined complexity of the model is just a combination of these two complexities. This is equivalent to the lowest complexities of similar models, however, this approach achieves better results as seen in the following table.<br />
<br />
The results are better across all labelled set sizes for the M1+M2 model and drastically better for when the number of labelled data samples is very small (100 out of 50000). <br />
<br />
[[File:kingma_2014_2.png | centre|]]<br />
<br />
The following figure demonstrates the model's ability to generate images through conditional generation. The class label was fixed and then the latent variables, '''z''', were altered. The figure shows how the latent variables were varied and how the generated digits are similar for similar values of '''z'''s. Parts b and c of the figure use a test image to generate images that belong to a similar set of '''z''' values (images that are similar). <br />
<br />
[[File:kingma_2014_3.png |thumb|upright=3|centre|]]<br />
<br />
A commendable part of this paper is that they have actually included their [http://github.com/dpkingma/nips14-ssl source code]. <br />
<br />
= Conclusions and Critique =<br />
<br />
The results using this method are obviously impressive and the fact that the model achieves this with comparable computation times compared to the other models is notable. The heavy use of approximate inference methods shows great promise in improving generative models and thus semi-supervised methods. The authors discuss the potential of combining this method with the supervised methods that have given state-of-the-art results in image processing, convolutional neural networks. This might be possible as all parameters in their models are optimized using neural networks. The final model acts as a approximate Bayesian inference model. <br />
<br />
The architecture of the model is not very explicit in the paper, that is, a diagram showing the layout of the entire model would have ameliorated understanding. Another weak point is that they fail to compare their method to existing tractable inference neural network methods. There is no comparison to Sum Product Networks nor Deep Belief Networks.</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=mULTIPLE_OBJECT_RECOGNITION_WITH_VISUAL_ATTENTION&diff=27758mULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION2017-08-30T13:46:35Z<p>Conversion script: Conversion script moved page MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION to mULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
Recognizing multiple objects in images has been one of the most important goals of computer vision. Previous work in this classification of sequences of characters often employed a sliding window detector with an individual character-classifier. However, these systems can involve setting components in a case-specific manner for determining possible object locations. In this paper an attention-based model for recognizing multiple objects in images is presented. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. It has been shown that the proposed method is more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.<br />
One of the main drawbacks of convolutional networks (ConvNets) is their poor scalability with increasing input image size so efficient implementations of these models have become necessary. In this work, the authors take inspiration from the way humans perform visual sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. The proposed system is a deep recurrent neural network that at each step processes a multi-resolution crop of the input image, called a “glimpse”. The network uses information from the glimpse to update its internal representation of the input, and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides that there are no more objects to process.<br />
<br />
= Deep Recurrent Visual Attention Model:=<br />
<br />
For simplicity, they first describe how our model can be applied to classifying a single object and later show how it can be extended to multiple objects. Processing an image x with an attention based model is a sequential process with N steps, where each step consists of a glimpse. At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln. The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step. A graphical representation of the proposed model is shown in Figure 1.<br />
<br />
[[File:0.PNG | center]]<br />
<br />
The above model can be broken down into a number of sub-components, each mapping some input into a vector output. In this paper the term “network” is used to describe these sub-components.<br />
<br />
Glimpse Network:<br />
<br />
The job of the glimpse network is to extract a set of useful features from location of a glimpse of the raw visual input. The glimpse network is a non-linear function that receives the current input image patch, or glimpse (<math>x_n</math>), and its location tuple (<math>l_n</math>) as input and outputs a vector showing that what location has what features. <br />
There are two separate networks in the structure of glimpse network, each of which has its own input. The first one which extracts features of the image patch takes an image patch as input and consists of three convolutional hidden layers without any pooling layers followed by a fully connected layer. Separately, the location tuple is mapped using a fully connected hidden layer. Then element-wise multiplication of two output vectors produces the final glimpse feature vector <math>g_n</math>.<br />
<br />
Recurrent Network:<br />
<br />
The recurrent network aggregates information extracted from the individual glimpses and combines the information in a coherent manner that preserves spatial information. The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step.<br />
The recurrent network consists of two recurrent layers. Two outputs of the recurrent layers are defined as <math>r_n^{(1)}</math> and <math>r_n^{(2)}</math>.<br />
<br />
Emission Network:<br />
<br />
The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network. It consists of a fully connected hidden layer that maps the feature vector <math>r_n^{(2)}</math> from the top recurrent layer to a coordinate tuple <math>l_{n+1}</math>.<br />
<br />
Context Network:<br />
<br />
The context network provides the initial state for the recurrent network and its output is used by the emission network to predict the location of the first glimpse. The context network C(.) takes a down-sampled low-resolution version of the whole input image <math>I_coarse</math> and outputs a fixed length vector <math>c_I</math> . The contextual information provides sensible hints on where the potentially interesting regions are in a given image. The context network employs three convolutional layers that map a coarse image <math>I_coarse</math> to a feature vector.<br />
<br />
Classification Network:<br />
<br />
The classification network outputs a prediction for the class label y based on the final feature vector <math>r_N^{(1)}</math> of the lower recurrent layer. The classification network has one fully connected hidden layer and a softmax output layer for the class y.<br />
<br />
In order to prevent the model to learn from contextual information than by combining information from different glimpses, the context network and classification network are connected to different recurrent layers in the deep model. This will help the deep recurrent attention model learn to look at locations that are relevant for classifying objects of interest.<br />
<br />
= Learning Where and What=<br />
<br />
Given the class labels y of image “I”, learning can be formulated as a supervised classification problem with the cross entropy objective function. The attention model predicts the class label conditioned on intermediate latent location variables l from each glimpse and extracts the corresponding patches. We can thus maximize likelihood of the class label by marginalizing over the glimpse locations.<br />
<br />
[[File:2eq.PNG | center]]<br />
<br />
Using some simplifications, the practical algorithm to train the deep attention model can be expressed as:<br />
<br />
[[File:3.PNG | center]]<br />
<br />
Where <math>\tilde{l^m}</math> is an approximation of location of glimpse “m”.This means that we can sample he glimpse location prediction from the model after each glimpse. In the above equation, log likelihood (in the second term) has an unbounded range that can introduce substantial high variance in the gradient estimator and sometimes induce an undesired large gradient update that is backpropagated through the rest of the model. So in this paper this term is replaced with a 0/1 discrete indicator function (R) and a baseline technique(b) is used to reduce variance in the estimator. <br />
<br />
[[File:4eq.PNG | center]]<br />
<br />
So the gradient update can be expressed as following:<br />
<br />
[[File:5.PNG | center]]<br />
<br />
In fact, by using the 0/1 indicator function, the learning rule from the above equation is equivalent to the REINFORCE learning model where R is the expected reward.<br />
During inference, the feedforward location prediction can be used as a deterministic prediction on<br />
the location coordinates to extract the next input image patch for the model. Alternatively, our marginalized objective function suggests a procedure to estimate the expected class prediction by using samples of location sequences <math>\{\tilde{l_1^m},\dots,\tilde{l_N^m}\}</math> and averaging their predictions.<br />
<br />
[[File:6.PNG | center]]<br />
<br />
= Multi Object/Sequence Classification as a Visual Attention Task=<br />
<br />
Our proposed attention model can be easily extended to solve classification tasks involving multiple objects. To train the recurrent network, in this case, the multiple object labels for a given image need to be cast into an ordered sequence {y1,...,ys}. Assuming there are S targets in an image, the objective function for the sequential prediction is:<br />
<br />
[[File:7.PNG | center]]<br />
<br />
= Experiments:=<br />
<br />
To show the effectiveness of the deep recurrent attention model (DRAM), multi-object classification tasks are investigated on two different datasets: MNIST and multi-digit SVHN.<br />
<br />
MNIST Dataset Results:<br />
<br />
Two main evaluation of the method is done using MNIST dataset:<br />
<br />
1)Learning to find digits<br />
<br />
2)Learning to do addition (The model has to find where each digit is and add them up. The task is to predict the sum of the two digits in the image as a classification problem)<br />
<br />
The results for both experiments are shown in table 1 and table 2. As stated in the tables, the DRAM model with a context network significantly outperforms the other models.<br />
<br />
[[File:8.PNG | center]]<br />
<br />
SVHN Dataset Results:<br />
<br />
The publicly available multi-digit street view house number (SVHN) dataset consists of images of digits taken from pictures of house fronts. This experiment is more challenging and We trained a model to classify all the digits in an image sequentially. Two different model are implemented in this experiment:<br />
First, the label sequence ordering is chosen to go from left to right as the natural ordering of the house number.in this case, there is a performance gap between the state-of-the-art deep ConvNet and a single DRAM that “reads” from left to right. Therefore, a second recurrent attention model to “read” the house numbers from right to left as a backward model is trained. The forward and backward model can share the same weights for their glimpse networks but they have different weights for their recurrent and their emission networks. The model performance is shown in table 3:<br />
<br />
[[File:9.PNG | center]]<br />
<br />
As shown in the table, the proposed deep recurrent attention model (DRAM) outperforms the state-ofthe-<br />
art deep ConvNets on the standard SVHN sequence recognition task.<br />
<br />
= Discussion and Conclusion:=<br />
<br />
The recurrent attention models only process a selected subset of the input have less computational cost than a ConvNet that looks over an entire image. Also, they can naturally work on images of different size with the same computational cost independent of the input dimensionality. Moreover, the attention-based model is less prone to over-fitting than ConvNets, likely because of the stochasticity in the glimpse policy during training. Duvedi C et al. <ref><br />
Duvedi C and Shah P. [http://vision.stanford.edu/teaching/cs231n/reports/cduvedi_report.pdf Multi-Glance Attention Models For Image Classification ], <br />
</ref> developed a two glances approach that uses a combination of multiple Convolutional neural nets and recurrent neural nets. In this approach RNN’s generate a location for a glimpse within the image and a CNN extracts features from a glimpse of a fixed size at the selected location. In the next step the RNN’s generating the next glance for a glance. The process continues to generate the whole picture by combining features of all relevant patches together.<br />
<br />
<br />
= References=<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27760on the difficulty of training recurrent neural networks2017-08-30T13:46:35Z<p>Conversion script: Conversion script moved page On the difficulty of training recurrent neural networks to on the difficulty of training recurrent neural networks: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
test<br />
<br />
Training Recurrent Neural Network (RNN) is difficult and two of the most prominent problems have been vanishing and exploding gradients, <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents neural networks from learning and fitting data with long-term dependencies. In this paper the authors propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weight matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradient equations for using the Back Propagation Through Time (BPTT) algorithm. The authors rewrote the equations in order to highlight the exploding gradients problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parameterization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients.<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden change in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will explode accordingly. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotic state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape).<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes then so will the second derivative. In the general case, when the gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away and possibly hinder the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient, otherwise the deflection caused by an update of SGD would still hinder the learning process despite clipping being used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the clipping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times and from the figure below we can observe the importance of gradient clipping and the regularizer. In all cases, the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding gradients correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer and their experimental results showed that in all cases except for the Penn Treebank dataset, clipping and regularizer has improved on the results for the RNNs in their respective experiments.</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semi-supervised_Learning_with_Deep_Generative_Models&diff=27757Semi-supervised Learning with Deep Generative Models2017-08-30T13:46:35Z<p>Conversion script: Conversion script moved page Semi-supervised Learning with Deep Generative Models to semi-supervised Learning with Deep Generative Models: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[semi-supervised Learning with Deep Generative Models]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MULTIPLE_OBJECT_RECOGNITION_WITH_VISUAL_ATTENTION&diff=27759MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION2017-08-30T13:46:35Z<p>Conversion script: Conversion script moved page MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION to mULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[mULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_the_difficulty_of_training_recurrent_neural_networks&diff=27761On the difficulty of training recurrent neural networks2017-08-30T13:46:35Z<p>Conversion script: Conversion script moved page On the difficulty of training recurrent neural networks to on the difficulty of training recurrent neural networks: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[on the difficulty of training recurrent neural networks]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Strategies_for_Training_Large_Scale_Neural_Network_Language_Models&diff=27751Strategies for Training Large Scale Neural Network Language Models2017-08-30T13:46:34Z<p>Conversion script: Conversion script moved page Strategies for Training Large Scale Neural Network Language Models to strategies for Training Large Scale Neural Network Language Models: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[strategies for Training Large Scale Neural Network Language Models]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_Fast_Approximations_of_Sparse_Coding&diff=27753Learning Fast Approximations of Sparse Coding2017-08-30T13:46:34Z<p>Conversion script: Conversion script moved page Learning Fast Approximations of Sparse Coding to learning Fast Approximations of Sparse Coding: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[learning Fast Approximations of Sparse Coding]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=27755Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2017-08-30T13:46:34Z<p>Conversion script: Conversion script moved page Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships to deep Neural Nets as a Method for Quantitative Structure–Activity Relationships: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[deep Neural Nets as a Method for Quantitative Structure–Activity Relationships]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27748on using very large target vocabulary for neural machine translation2017-08-30T13:46:34Z<p>Conversion script: Conversion script moved page On using very large target vocabulary for neural machine translation to on using very large target vocabulary for neural machine translation: Converting page titles to lowercase</p>
<hr />
<div>==Overview==<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<br />
<ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref><br />
The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in machine translation over statistical machine translation systems such as the phrase-based system, they suffer from some technical problems. Most importantly, they are limited to a small vocabulary because of complexity and number of parameters that have to be trained as total vocabulary increases. The output of a RNN used for machine translation will have as many dimensions as there are words in the vocabulary. If the total vocabulary consists of hundreds of thousand of words, then the RNN must compute a very expensive softmax on the output vector at each time step and estimate the probability of each word as the next word in the sequence. Therefore, the number of parameters in the RNN will also grow very large in such cases given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a non-trivial sized hidden layer, a large vocabulary could result in tens of millions of model parameters purely associated with the hidden-to-output mapping. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary such as names. <br />
<br />
In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed. The algorithm is tested on two machine translation tasks (English <math>\rightarrow</math> German, and English <math>\rightarrow</math> French), and it achieved the best performance out of any previous single neural machine translation (NMT) system on the English <math>\rightarrow</math> French translation task.<br />
<br />
==Methods==<br />
<br />
Recall that the classic neural machine translation system works through an encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto \exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>\, z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>\, c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=\arg\max\sum_{n=1}^{N}\sum_{t=1}^{T_n}\log p(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_t^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector <math>c_t</math> as a convex sum of the hidden states <math>(h_1,...,h_T)</math> with the coefficients <math>(\alpha_1,...,\alpha_T)</math> computed by<br />
<br />
<math>\alpha_t=\frac{\exp\{a(h_t, z_t)\}}{\sum_{k}\exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} \exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}\exp\left(W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\right)</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary and is computationally complex and time consuming. Furthermore, the memory requirements grow linearly with respect to the number of target word. This has been a major hurdle for neural machine translations. Recent approaches use a shortlist of 30,000 to 80,000 most frequent words. This makes training more feasible but also has problems of its own. For example, the model degrades heavily if the translation of the source sentence requires many words that are not included in the shortlist. The approach of this paper uses only a subset of sampled target words as an align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping of words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=\log p(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>\,w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
<br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
==Setting==<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the data set at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of words for each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They used a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
==Results==<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translation specific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the data set. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table below. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
{| class="wikitable"<br />
|-<br />
! Method <br />
! CPU i7-4820k<br />
! GPU GTX TITAN black<br />
|-<br />
| RNNsearch<br />
| 0.09 s<br />
| 0.02 s<br />
|-<br />
| RNNsearch-LV <br />
| 0.80 s<br />
| 0.25 s<br />
|-<br />
| RNNsearch-LV<br />
+Candidate list<br />
| 0.12 s<br />
| 0.0.05 s<br />
|}<br />
<br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. <br />
<br />
<br />
==Conclusion==<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=strategies_for_Training_Large_Scale_Neural_Network_Language_Models&diff=27750strategies for Training Large Scale Neural Network Language Models2017-08-30T13:46:34Z<p>Conversion script: Conversion script moved page Strategies for Training Large Scale Neural Network Language Models to strategies for Training Large Scale Neural Network Language Models: Converting page titles to lowercase</p>
<hr />
<div>'''<br />
== Introduction ==<br />
'''<br />
Statistical models of natural languages are a key part of many systems today. The most widely used known applications are automatic speech recognition, machine translation, and optical character recognition. In recent years language models, including Recurrent Neural Network and Maximum Entropy-based models have gained a lot of attention and are considered the most successful models. However, the main drawback of these models is their huge computation complexity. <br />
This paper introduces a hash-based implementation of a class based maximum entropy model, that allows to easily control the trade-off between memory complexity and computational<br />
complexity.<br />
'''<br />
<br />
== Motivation==<br />
'''<br />
As computational complexity is an issue for different types of deep neural network language models, this study briefly presents simple techniques that can be used to reduce computational cost of the training and test phases. The study also mentions that training neural network language models with maximum entropy models leads to better performance in terms of computational complexity. <br />
The maximum entropy model can be viewed as a Neural network model with no hidden layer with the input layer directly connected to the output<br />
layer.<br />
<br />
'''<br />
<br />
== Model description==<br />
The main difference between a neural network language model and Maximum entropy is that the features for the NN LM are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while<br />
NN models use continuous-valued features. After the model is trained, similar words have similar<br />
low-dimensional representations<br />
'''<br />
'''<br />
<br />
== Recurrent Neural Network Models==<br />
'''<br />
The standard neural network language model has a very similar form to the maximum entropy model. The main difference is that the features for this model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while NN models use continuous-valued features. The NN LM can be described as:<br />
<br />
<math>P(w|h)=\frac{e^{\sum_{k=1}^N \lambda_i f_i(s,w)}} {\sum_{w=1} e^{ \sum_{k=1}^N\lambda_i f_i(s,w)}}</math><br />
<br />
where f is a set of feature, λ is a set of weights, and s is a state of the hidden layer. For the feedforward NN LM architecture, the state of the hidden layer depends on a projection layer, that is formed as a projection of N − 1 recent words into low-dimensional space. After the model is trained, similar words have similar low-dimensional representations. Alternatively, the state of hidden layer can depend on the most recent word and the state in the previous time step. Thus, the time is not represented explicitly. This recurrence allows the hidden layer to represent low-dimensional representation of the entire history (or in other words, it provides the model with a memory). The architecture is called the Recurrent neural network based language model (RNN LM)<ref name=MiT1><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref> <ref name=MiT2> Mikolov, Tomas, ''et al'' [http://www.fit.vutbr.cz/~imikolov/rnnlm/is2011_emp.pdf"“Empirical evaluation and combination of advanced language modeling techniques"] in Proceedings of Interspeech, (2010). </ref>.<br />
<br />
[[File:Fig.jpg |center]]<br />
Feedforward neural network 4-gram model (on the left) and Recurrent neural network language model (on the right) <br />
<br />
'''<br />
<br />
== Maximum Entropy model ==<br />
'''<br />
'''<br />
A maximum entropy model has the following form:<br />
<br />
<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(h,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(h,w)}</math><br />
<br />
where h is a history, f is the the set of features, which in maximum entropy case are n grams. The choice of features is usually done<br />
manually, and significantly affects the overall performance of<br />
the model. Training the maximum entropy model consists of learning the set of weights λ.<br />
<br />
'''<br />
<br />
== Computational complexity ==<br />
'''<br />
'''<br />
The training time of N-gram neural network language model is proportional to:<br />
<br />
<math>\,I*W*((N-1) *D*H+H*V)</math><br />
<br />
where I is the number of training epochs before convergence is achieved, W is the number of tokens in the training set, N is the N-gram order, D is the dimensionality of words in the low-dimensional space, H is size of the hidden layer and V is the size of the vocabulary.<br />
<br />
The recurrent NN LM has computational complexity as:<br />
<br />
<math>\,I*W*(H*H+H*V)</math><br />
<br />
It can be seen that by increasing order N, the complexity of the feedforward architecture increases linearly, while it remains constant for the recurrent one.<br />
<br />
The computational complexity in the maximum entropy model is also described as follows:<br />
<br />
<math>\,I*W*(N*V)</math><br />
<br />
The simple techniques used in the present study to reduce the computational complexity are:<br />
<br />
'''<br />
<br />
== A. Reduction of training epochs==<br />
'''<br />
'''<br />
Training is usually performed by stochastic gradient descent, and takes 10-50 training epocs to converge. <br />
In this study, it has been demonstrated that good performance can be achieved while performing as few as 7 training epochs instead of using thousands of epochs. This is achieved by sorting the training data by complexity. <br />
<br />
'''<br />
<br />
== B. Reduction of number of training tokens==<br />
'''<br />
In a vast majority of cases, NN LMs for LVCSR tasks are<br />
trained on 5-30M tokens. Although the subsampling trick can<br />
be used to claim that the neural network model has seen all<br />
training data at least once, simple subsampling techniques lead<br />
to severe performance degradation, against a model that is<br />
trained on all data<br />
<br />
In this study, NN LMs are trained only on a small part of the data (which are in-domain corpora) plus some randomly subsampled part of out-of-domain data. <br />
<br />
'''<br />
<br />
== C. Reduction of vocabulary==<br />
'''<br />
One technique is to compute probability distribution<br />
only for the top M words in the neural network model and for the<br />
rest of the words use backoff n-gram probabilities. The list<br />
of top M words is then called a shortlist. However, it was<br />
shown in that this technique causes severe degradation of<br />
performance for small values of M, and even with M = 2000,<br />
the complexity of the H × V term is still significant.<br />
Goodman’s trick can be used for speeding up the models in terms of vocabulary. Each word from the vocabulary is assigned to a class and only the probability distribution over classes is computed. As the number of classes can be very small (several hundreds),<br />
this is a more effective solution than using shortlists, and<br />
the performance degradation is smaller. <br />
<br />
'''<br />
<br />
== D. Reduction of size of the hidden layer==<br />
'''<br />
<br />
Another way to reduce H×V is to choose a small value of H. Some techniques with respect to the combination of NN model with other methods are introduced for choosing the proper size of the hidden layer.<br />
<br />
'''<br />
== E. Parallelization ==<br />
'''<br />
<br />
As the state of the hidden layer depends on the previous state, the recurrent networks are hard to be parallelized. One can parallelize just the computation between hidden and output layers. The other way is to parallelize the whole network by training from multiple points in the training data at the same time. However, parallelization is a highly architecture-specific optimization problem. In the current study, this problem is dealt with algorithmic approaches for reducing computational complexity.<br />
<br />
<br />
'''<br />
<br />
== Automatic data selection and sorting==<br />
'''<br />
<br />
The full training set is divided into 560 equally-sized chunks, and the perplexity on the development data is computed on each chunk. The data chunks with perplexity above 600 are discarded to obtain the reduced sorted training set.<br />
<br />
[[File:fig2.jpg | center]]<br />
'''<br />
== Experiment with large RNN models ==<br />
'''<br />
<br />
By training the RNN model on the reduced sorted dataset and increasing the hidden layer, better results than baseline backoff model are obtained. However, the performance of RNN models is strongly correlated with the size of the hidden layer. Combining the RNN models with baseline 4-gram model and tuning the weights of individual models on the development set leads to quite impressive reduction of WER.<br />
<br />
[[File:table.jpg | center]]<br />
<br />
'''<br />
<br />
== Hash-based implementation of class-based maximum entropy model ==<br />
'''<br />
<br />
The maximum entropy model can be seen in the context of neural network models as a weight matrix that directly connects the input and output layers. In the present study, direct connections are added to the class-based RNN architecture. Direct parameters are used to connect input and output layers, and input and class layers. This model is denoted as RNNME. <br />
<br />
Using direct connections leads to problems in memory complexity. To avoid this problem, a hash function is used to map the huge sparse matrix into one dimensional array. Using the underlying method, the achieved perplexity is better than the baseline perplexity of the KN4 model. Even better results are gained after interpolation of both models, and using rescoring experiment.<br />
<br />
'''<br />
== Conclusions ==<br />
'''<br />
<br />
Some of the contributions and demonstrations in the paper are as follows:<br />
<br />
* Constructing of the largest neural-network-based language models ever trained (to date), including 400M tokens, 640 hidden layer neurons, and 84K word vocabulary<br />
* Showing that removing and sorting parts of the training data can reduce perplexity by about 10% and possibly more<br />
* Dropping relative word error rate (WER) to almost 11%<br />
* Showing that an RNN model with direct connections does not need very large hidden layers in order to perform well <br />
<br />
'''<br />
<br />
== References ==<br />
'''<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=27752learning Fast Approximations of Sparse Coding2017-08-30T13:46:34Z<p>Conversion script: Conversion script moved page Learning Fast Approximations of Sparse Coding to learning Fast Approximations of Sparse Coding: Converting page titles to lowercase</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.The main contribution of this paper is a highly efficient learning-based method that computes good approximations of optimal sparse codes in a fixed amount of time.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
Here baseline iterative shrinkage algorithms for finding sparse codes are introduced and explained. The ISTA and FISTA methods update the whole code vector in parallel, while the more efficient Coordinate Descent method (CoD) updates the components one at a time and carefully selects which component to update at each step.<br />
Both methods refine the initial guess through a form of mutual inhibition between code component, and component-wise shrinkage.<br />
<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent (CoD) adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
The CoD algorithm is presented below:<br />
<br />
<blockquote><br />
<math>\textbf{function} \, \textbf{CoD}\left(X, Z, W_d, S, \alpha\right)</math><br />
: <math>\textbf{Require:} \,S = I - W_d^T W_d</math><br />
: <math>\textbf{Initialize:} \,Z = 0; B = W_d^TX</math><br />
: <math> \textbf{repeat}</math><br />
:: <math>\bar{Z} = h_{\alpha}\left(B\right)</math><br />
:: <math> \,k = \mbox{ index of largest component of} \left|Z - \bar{Z}\right|</math><br />
:: <math> \forall j \in \left[1, m\right]: B_j = B_j + S_{jk}\left(\bar{Z}_k - Z_k\right)</math><br />
:: <math> Z_k = \bar{Z}_k</math><br />
: <math>\textbf{until}\,\text{change in}\,Z\,\text{is below a threshold}</math> <br />
: <math> Z = h_{\alpha}\left(B\right)</math><br />
<math> \textbf{end} \, \textbf{function} </math><br />
</blockquote><br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. This algorithm has a similar feedback concept to ISTA, but can it can expressed as a linear feedback operation with a very sparse matrix (since only one component is updated at a time). Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
The algorithm for LCoD can be summarized as <br />
<br />
<br />
[[File:Q12.png]]<br />
<br />
<br />
A main advantage of the system proposed in this paper is speed, so it is necessary to take note of the asymptotic complexity of the above algorithm: only <math>\, O(m)</math> operations are required for each step of the bprop procedure, and each iteration only requires <math>\, O(m)</math> space; as almost all of the stored variables are scalar, with the exception of <math>\, B(T)</math>. (Recall that m refers to the number of dimensions in the new feature space with the sparse representations.)<br />
<br />
= Empirical Performance =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their CoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their CoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations. <br />
<br />
A complete feature vector consisted of 25 concatenated such vectors, extracted<br />
from all 16 × 16 patches shifted by 3 pixels on the input.<br />
The features were extracted for all digits using<br />
CoD with exact inference, CoD with a fixed number of<br />
iterations, and LCoD. Additionally a version of CoD<br />
(denoted CoD’) used inference with a fixed number<br />
of iterations during training of the filters, and used<br />
the same number of iterations during test (same complexity<br />
as LCoD). A logistic regression classifier was<br />
trained on the features thereby obtained.<br />
<br />
Classification errors on the test set are shown in the following figures . While the error rate decreases with the<br />
number of iterations for all methods, the error rate<br />
of LCoD with 10 iterations is very close to the optimal<br />
(differences in error rates of less than 0.1% are<br />
insignificant on MNIST)<br />
<br />
[[File:T1.png]]<br />
<br />
MNIST results with 784-D sparse codes<br />
<br />
MNIST results with 25 256-D sparse codes extracted<br />
from 16 × 16 patches every 3 pixels<br />
<br />
<br />
[[File:T2.png]]<br />
<br />
= Conclusions =<br />
<br />
The idea of time unfolding an inference algorithm in order to construct a fixed-depth network in application to sparse coding is introduced in this paper. In sparse coding, inference algorithms are iterative and converge to a fixed point. In this paper it is proposed to unroll an inference algorithm for a fixed number of iterations in order to define an approximator network.The main result of this paper is the demonstration that the number of iterations required to reach a given code prediction error can be heavily reduced - by a factor of about 20 - when learning the filters and mutual inhibition matrices FISTA and CoD, when truncated. In other words, not much data-specific mutual inhibition is required to handle the phenomenon of "explaining away" superfluous parts of the code vector.<br />
<br />
= References =<br />
References<br />
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding<br />
algorithm with application to waveletbased<br />
image deblurring. ICASSP’09, pp. 693–696, 2009.<br />
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic<br />
decomposition by basis pursuit. SIAM review, 43(1):<br />
129–159, 2001.<br />
<br />
Daubechies, I, Defrise, M., and De Mol, C. An iterative<br />
thresholding algorithm for linear inverse problems with a<br />
sparsity constraint. Comm. on Pure and Applied Mathematics,<br />
57:1413–1457, 2004.<br />
<br />
Donoho, D.L. and Elad, M. Optimally sparse representation<br />
in general (nonorthogonal) dictionaries via ℓ<br />
1 minimization.<br />
PNAS, 100(5):2197–2202, 2003.<br />
<br />
Elad, M. and Aharon, M. Image denoising via learned dictionaries<br />
and sparse representation. In CVPR’06, 2006.<br />
Hale, E.T., Yin, W., and Zhang, Y. Fixed-point continuation<br />
for l1-minimization: Methodology and convergence.<br />
SIAM J. on Optimization, 19:1107, 2008.<br />
Hoyer, P. O. Non-negative matrix factorization with<br />
sparseness constraints. JMLR, 5:1457–1469, 2004.<br />
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,<br />
Y. What is the best multi-stage architecture for object<br />
recognition? In ICCV’09. IEEE, 2009.<br />
<br />
Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun,<br />
Yann. Fast inference in sparse coding algorithms<br />
with applications to object recognition. Technical Report<br />
CBLL-TR-2008-12-01, Computational and Biological<br />
Learning Lab, Courant Institute, NYU, 2008.<br />
<br />
Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient<br />
sparse coding algorithms. In NIPS’06, 2006.<br />
<br />
Lee, H., Chaitanya, E., and Ng, A. Y. Sparse deep belief<br />
net model for visual area v2. In Advances in Neural<br />
Information Processing Systems, 2007.<br />
<br />
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional<br />
deep belief networks for scalable unsupervised<br />
learning of hierarchical representations. In International<br />
Conference on Machine Learning. ACM New York, 2009.<br />
Li, Y. and Osher, S. Coordinate descent optimization for<br />
l1 minimization with application to compressed sensing;<br />
a greedy algorithm. Inverse Problems and Imaging, 3<br />
(3):487–503, 2009.<br />
<br />
Mairal, J., Elad, M., and Sapiro, G. Sparse representation<br />
for color image restoration. IEEE T. Image Processing,<br />
17(1):53–69, January 2008.<br />
<br />
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online<br />
dictionary learning for sparse coding. In ICML’09, 2009.<br />
Olshausen, B.A. and Field, D. Emergence of simple-cell<br />
receptive field properties by learning a sparse code for<br />
natural images. Nature, 381(6583):607–609, 1996.<br />
<br />
Ranzato, M., Huang, F.-J., Boureau, Y.-L., and LeCun,<br />
Y. Unsupervised learning of invariant feature hierarchies<br />
with applications to object recognition. In CVPR’07.<br />
IEEE, 2007a.<br />
<br />
Ranzato, M.-A., Boureau, Y.-L., Chopra, S., and LeCun,<br />
Y. A unified energy-based framework for unsupervised<br />
learning. In AI-Stats’07, 2007b.<br />
<br />
Rozell, C.J., Johnson, D.H, Baraniuk, R.G., and Olshausen,<br />
B.A. Sparse coding via thresholding and local<br />
competition in neural circuits. Neural Computation, 20:<br />
2526–2563, 2008.<br />
<br />
Vonesch, C. and Unser, M. A fast iterative thresholding algorithm<br />
for wavelet-regularized deconvolution. In IEEE<br />
ISBI, 2007.<br />
<br />
Wu, T.T. and Lange, K. Coordinate descent algorithms<br />
for lasso penalized regression. Ann. Appl. Stat, 2(1):<br />
224–244, 2008.<br />
<br />
Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang,<br />
Thomas. Linear spatial pyramid matching using sparse<br />
coding for image classification. In CVPR’09, 2009.<br />
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning<br />
using local coordinate coding. In NIPS’09, 2009.</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=27754deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2017-08-30T13:46:34Z<p>Conversion script: Conversion script moved page Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships to deep Neural Nets as a Method for Quantitative Structure–Activity Relationships: Converting page titles to lowercase</p>
<hr />
<div>== Introduction ==<br />
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.<br />
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field. <br />
<br />
<br />
== Motivation ==<br />
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models. <br />
<br />
== Methods ==<br />
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:<br />
<br />
atom type i − (distance in bonds) − atom type j<br />
<br />
Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>). <br />
<br />
To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.<br />
<br />
The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum_{i=1}^{N} w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to: <br />
<br />
-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation. <br />
<br />
-Network architecture: number of hidden layers, number of neurons in each hidden layer.<br />
<br />
-Activation functions: sigmoid or rectified linear unit.<br />
<br />
-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.<br />
<br />
-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs<br />
<br />
-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.<br />
<br />
In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions. <br />
<br />
=== Regularization ===<br />
<br />
A very common problem with deep neural networks is overfitting as the number of weights can increase exponentially with more layers and nodes. The researchers considered two methods for this issue, dropout which was described in a previous summary and pre-training.<br />
<br />
The general method for pre-training goes as follows:<br />
<br />
1. Break down the deep neural network into its subsequent layers.<br />
<br />
2. For each layer, take the input (either data or previous layer output) and train the layer to project the input in a way that captures the maximum amount of variation similar to <br />
dimension reduction techniques such as PCA. This was usually done with either auto-encoders by encoding the input in a lower dimension or Restricted Boltzmann machines.<br />
<br />
3. After each layer has been trained this way, the parameters of the model are now initialized with some set of weights that depend on the data.<br />
<br />
The regularization of this works as follows, consider the surface of the objective function based on weights, due to the complexity of neural networks, this surface is going to vary significantly throughout and would contain many local minimas. Gradient descent tends to get trapped in local minimas and it can be difficult to reach a better minima with random weights. The hope is that by training the deep neural network to capture almost all of the variation of the data, the set of weights resulting from training would be near a good local minima and it could then calibrate through gradient descent to the optimal solution. This would be similar to the idea of combining PCA with some other classifier, i.e. first map the points to a subspace that is easily linearly separable then the classifier could easily classify. This can also be thought of as, once the first few layers projects the points to an easier linearly separable subspace, subsequent layers in the network can work on classifying these projected points. If these set of pre-trained weights are near a local minima, gradient descent would heavily restrict their range of values since it would travel towards the minima immediately and this restriction of values acts as a regularizer on the whole neural network.<br />
<br />
However, when the researchers tried this with some modifications to accommodate their code, it did not improve results.<br />
<br />
== Results ==<br />
<br />
For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.<br />
<br />
<br />
<center><br />
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the<br />
improvement, measured in <math>R^2</math>, of a DNN over RF ]]<br />
</center><br />
<br />
comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).<br />
<br />
<br />
<center><br />
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]<br />
</center><br />
<br />
The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.<br />
<br />
<center><br />
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]<br />
</center><br />
<br />
To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016. <br />
<br />
<center><br />
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of<br />
DNNs trained with ReLU and Sigmoid, respectively ]]<br />
</center><br />
<br />
Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets. <br />
<br />
<center><br />
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]<br />
</center><br />
<br />
The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.<br />
<br />
<center><br />
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]<br />
</center><br />
<br />
as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:<br />
<br />
- Logarithmic transformation. <br />
<br />
- Four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.<br />
<br />
- The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.<br />
<br />
- The activation function of ReLU.<br />
<br />
- No unsupervised pretraining. The network parameters should be initialized as random values.<br />
<br />
- Large number of epochs.<br />
<br />
- Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.<br />
<br />
To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13<br />
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.<br />
<br />
<center><br />
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]<br />
</center><br />
<br />
Both RF and DNN can be efficiently speeded up using high-performance computing technologies, but in a different way due to the inherent difference in their algorithms. RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. With the dramatic advance in GPU hardware and increasing availability of GPU computing resources, DNN can become comparable, if not more advantageous, to RF in various aspects, including easy implementation, computation time, and hardware cost.<br />
<br />
== Discussion ==<br />
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. <br />
<br />
== Future Works ==<br />
<br />
In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.<br />
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.<br />
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.<br />
<br />
== Bibliography ==<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27749On using very large target vocabulary for neural machine translation2017-08-30T13:46:34Z<p>Conversion script: Conversion script moved page On using very large target vocabulary for neural machine translation to on using very large target vocabulary for neural machine translation: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[on using very large target vocabulary for neural machine translation]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_loss_surfaces_of_multilayer_networks_(Choromanska_et_al.)&diff=27744the loss surfaces of multilayer networks (Choromanska et al.)2017-08-30T13:46:33Z<p>Conversion script: Conversion script moved page The loss surfaces of multilayer networks (Choromanska et al.) to the loss surfaces of multilayer networks (Choromanska et al.): Converting page titles to lowercase</p>
<hr />
<div>= Overview =<br />
<br />
The paper ''Loss Surfaces of Multilayer Networks'' by Choromanska et al. is situated in the context of determining critical points (i.e. minima, maxima, or saddle points) of loss surfaces of deep multilayer network models, such as feedforward perceptrons.<br />
<br />
The authors present a model of multilayer rectified linear units (ReLUs), and show that it may be expressed as a polynomial function of the parameter matrices in the network, with a polynomial degree equal to the number of layers. The <span>ReLu</span> units produce a piecewise, continuous polynomial, with monomials that are nonzero or zero at the boundaries between pieces. With this model, they study the distribution of critical points of the loss polynomial, providing an analysis with results from random matrix theory applied to spherical spin glasses.<br />
<br />
The 3 key findings of this work are the following:<br />
<br />
* For large-size networks, most local minima are equivalent and yield similar performance on a test set.<br />
* The probability of finding a ''bad'' local minimum (i.e. one with a large value in terms of the loss function) may be large for small-size networks, but decreases quickly with network size.<br />
* Obtaining the global minimum of the loss function using a training dataset is not useful in practice.<br />
<br />
Many theoretical results are reported, which will not be exhaustively covered here. However, a high-level overview of proof techniques will be given, followed by a summary of the experimental results.<br />
<br />
= Prior Work =<br />
<br />
Earlier work has shown, for high-dimensional random Gaussian error functions, that critical points with error much higher than the global minimum are very likely to be saddle points (e.g. <ref>Bray, A. J., & Dean, D. S. (2007). Statistics of critical points of Gaussian fields on large-dimensional spaces. Physical review letters, 98(15), 150201.</ref>). Furthermore, all local minima are likely to be very close in functional value to the global minimum. Dauphin et al. <ref> Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). [http://papers.nips.cc/paper/5486-identifying-and-attacking-the-saddle-point-problem-in-high-dimensional-non-convex-optimization.pdf Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.] In Advances in Neural Information Processing Systems (pp. 2933-2941). </ref> empirically show that the cost functions of neural networks behave similarly to Gaussian error functions in high-dimensional spaces, but no theoretical justification is provided. This is one of the main contributions of this paper.<br />
<br />
In <ref>Auffinger, A., & Arous, G. B. (2013). Complexity of random smooth functions on the high-dimensional sphere. The Annals of Probability, 41(6), 4214-4247.</ref>, an asymptotic evaluation of the complexity of the spherical spin-glass model from condensed matter physics is provided. The authors found that the critical values with low Hamiltonian values have a layered structure that behaves like a Gaussian process. This work shows, under the assumptions listed in the overview, that the objective function used by a neural network is analogous to the Hamiltonian of the spin-glass problem. This means that they exhibit similar behaviour. This is not the first attempt at connecting the spin-glass problem with neural networks but none had attempted to optimize the neural network objective using the theory developed for the spin-glass problem. Thus, this paper is also novel in that respect. <br />
<br />
= Theoretical Analysis =<br />
<br />
Consider a simple fully-connected feed-forward deep network <math>\mathcal{N}</math> with a single output for a binary classification task. The authors use the convention that <math>(H-1)</math> denotes the number of hidden layers in the network (the input layer is the <math>0^{\text{th}}</math> layer and the output layer is the <math>H^{\text{th}}</math> layer). The input <math>X</math> is a vector with <math>d</math> elements, assumed to be random. The variable <math>n_i</math> denotes the number of units in the <math>i^{\text{th}}</math> layer (due to the network restrictions, <math>n_0 = d</math> and <math>n_H = 1</math>). Finally, <math>W_i</math> s the matrix of weights between <math>(i -<br />
1)^{\text{th}}</math> and <math>i^{th}</math> layers of the network and <math>\sigma = \max(0,x)</math> is the activation function. For a random input <math>X</math>, the random network output <math>Y</math> is <math>Y = q\sigma(W_H^{\top}\sigma(W_{H-1}^{\top}\dots\sigma(W_1^{\top}X)))\dots),</math> where <math>q</math> is a normalization factor.<br />
<br />
The key assumption in the theoretical work is the following: for <span>ReLu</span> activation functions <math>\sigma(x)</math> for a random variable <math>x</math>, the output can be seen as being equal to <math>\delta \cdot x</math>, where <math>x</math> is a (not necessarily random) nonzero variable and <math>\delta</math> is a ''new'' random variable that is identically equal to either 0 or 1. With this in mind, the output of the network can be re-expressed as: <math>Y = q\sum_{i=1}^{n_0}X_{i}\sum_{j = 1}^\gamma<br />
A_{i,j}\prod_{k = 1}^{H}w_{i,j}^{(k)},<br />
</math><br />
<br />
where <math>A_{i,j}</math> is a random variable equal to 0 or 1, denoting a path <math>(i,j)</math> to be active (<math>A_{i,j} = 1</math>) or not (<math>A_{i,j} = 0</math>). In this expression, the first summation over <math>i</math> is over the elements of the network input vector, and the second summation over <math>j</math> is over all ''paths'' from <math>X_i</math> to the output. The upper index on this second summation is <math>\gamma =<br />
n_1n_2\dots n_H</math> for all possible paths. The term <math>w_{i,j}^{(k)}</math> refers to the value of the parameter matrix in the layer that corresponds to the hidden vector element that produced the path (i.e. the <math>k^{\text{th}}</math> segment of path indexed with <math>(i,j)</math>); hence why there are <math>H w_{i,j}^{(k)}</math> terms per path.<br />
<br />
From this equation, it can be seen that the output of the <span>ReLu</span> network is polynomial in the weight matrix parameters, and the treatment of <math>A_{i,j}</math> as a random indicator variable allows connections to be made with spin glass models.<br />
<br />
The remainder of the theoretical analysis proceeds as follows:<br />
<br />
<ul><br />
<li><p>The input vector <math>X</math> and all <math>\{A_{i,j}\}</math> are assumed to be random variables, where <math>A_{i,j}</math> is a Bernoulli random variable and all input elements of <math>X</math> are independent.</p></li><br />
<li><p>One further critical assumption is the spherical constraint; all parameter weights <math>w_i</math> (elements of the parameter matrices) satisfy a spherical bound:</p><br />
<p><math>1/\Lambda \sum_i^\Lambda w_i^2 = C</math></p><br />
<p>for some <math>C > 0</math> where <math>\Lambda</math> is the number of parameters.</p></li><br />
<li><p>These assumptions allow the network output to be modeled as a ''spherical spin glass model'', which is a physical model for magnetic dipoles in ferromagnetic materials (a dipole has a magnetization state that is a binary random variable)</p></li><br />
<li><p>Using this assumption, the work by Auffinger et al. (2010) in the field of random matrices and spin glasses is then used to relate the energy states of system Hamiltonians of spin glass models to the critical points of the neural network loss function.</p></li><br />
<li><p>The analysis shows that the critical points of the loss function correspond to different energy bands in the spin glass model; as in a physical system, higher energy states are less probable; while the number of states is infinite, the probability of the system appearing in that state vanishes.</p></li><br />
<li><p>The energy barrier <math>E_{\infty}</math> stems from this analysis, and is given by</p><br />
<p><math>E_{\infty} = E_{\infty}(H) = 2\sqrt{\frac{H-1}{H}}.</math> Auffinger et al. show that critical values of the loss function must relate energies below <math>-\Lambda E_{\infty}</math> if their critical band index (i.e. energy index) is finite.</p></li></ul><br />
<br />
= Experiments =<br />
<br />
The numerical experiments conducted were to verify the theoretical claims of the distribution of critical points around the energy bound <math>E_{\infty}</math>, as well as to correlate the testing and training loss for different numbers of parameters <math>(\Lambda)</math> in the models.<br />
<br />
== MNIST Experiments ==<br />
<br />
<span>ReLu</span> neural networks with a single layer and increasing <math>\Lambda \in<br />
\{25,50,100,250,500 \}</math> were training for multiclass classification on a scaled-down version of the MNIST digit dataset, where each image was downsampled to <math>10<br />
\times 10</math> pixels. For each value of <math>\Lambda</math>, 200 epochs of SGD with a decaying learning rate were used to optimize the parameters in the network. The optimization experiments were performed 1000 times with different initial values for the weight parameters drawn uniformly randomly from <math>[-1,1]</math>.<br />
<br />
= Results =<br />
<br />
To evaluate the distribution of the energy states that each critical point (i.e. solution) of the loss function, the eigenvalues of the Hamiltonian matrix of the loss function was computed for the parameters after the optimization procdure completed. The distribution of the (normalized) index of the energy states is shown below in Fig 1. It can be seen that for all models with different numbers of parameters, the energy states occupied are the low energy bands.<br />
<center><br />
[[File:index_dist.png | frame | center |Fig 1. Distribution of normalized indices of energy states as computed from the system Hamiltonian at the final values of the parameters after the SGD optimization procedure completed.]]<br />
</center><br />
<br />
The final values of the loss function in these experiments are also shown in the histograms in Fig 2. Interestingly, the variance in the loss decreases with increasing numbers of parameters, despite the fact that the spread in the energy state (Fig. 1) increases. This shows that despite the fact that local minima are more prevalent for models with many parameters, there is no appreciable difference in the loss function at these minima: the minima are essentially all equally good in terms of minimizing the objective cost.<br />
<center><br />
[[File:loss_distribution.png | frame | center | Empirical distribution of the values of the loss function over the course of 1000 experiments with different numbers of parameters (Lambda). Each experimental run used a different random initialization of the parameter weights.]]<br />
</center><br />
<br />
Finally, a scatter plot of the training vs testing error for each model is shown in Fig. 3. It can be seen that the correlation between the two errors decreases as the number of parameters increases, suggesting that obtaining a global minimum would not necessarily produce better testing results (and hence still would have a sizeable generalization error).<br />
<center><br />
[[File:train_test_corr.png | frame | center |Fig 3. Scatter plots showing the the correlation between training and testing error for the MNIST dataset experiments. For few parameters in the network, there is a very strong correlation between the two errors. However, for networks with many more parameters, the correlations decrease in strength, suggesting that obtaining the optimal loss (critical point) in the training phase does not improve the generalization error. ]]<br />
</center><br />
<br />
<br />
<br />
=Discussion=<br />
==Power of Deep Neural Nets from the No Free Lunch Point View==<br />
A far out view for the explanation of why Deep Neural Networks has lower probability of bad local minimas is Woodward's <ref>Woodward, John R. "GA or GP? that is not the question." Evolutionary Computation, 2003. CEC'03. The 2003 Congress on. Vol. 2. IEEE, 2003.</ref> paper on why the No Free Lunch Theorem (NFLT) doesn't hold. First the NFLT is a theorem that basically states if one cannot incorporate domain specific knowledge into the search or optimization algorithm, one cannot guarantee the search/optimization algorithm can out perform (in terms of convergence speed) any other search/optimization algorithms, this implies there could be no universal search algorithm that is the best. <br />
<br />
Woodward's argument is that whether you use Genetic Algorithm or Genetic Programming doesn't matter, what matters is the solution mapping. Consider the case the task of [https://en.wikipedia.org/wiki/Symbolic_regression Symbolic Regression] where we have 2 algorithms <math>P</math> and <math>Q</math>, let <math>P_{s} = \{+ba, a, b, +aa, +bb, +ab\}</math> and <math>Q_{s} = \{+ab, +ba, a, b, +aa, +bb\}</math> be the time ordered explored solutions of <math>P</math> and <math>Q</math>. If the problem we face requires a solution <math>+ab</math> then both <math>P</math> and <math>Q</math> discovers the solution on their first try. However for any other solution algorithm <math>P</math> will always outperform <math>Q</math>.<br />
<br />
From the above we might be able to conclude that for deep neural networks the larger or deeper the network size, the more likely the network connections will be able to generate a function that minimizes the loss faster then smaller networks (since all networks were trained for 200 epochs, theoretically a single layer MLP should be able to approximate any function, but practically that could take forever), thus minimizing chances of bad local minimas (similar to how if you have a complex function you have a better chance of fitting data over a simpler function).<br />
<br />
= References =<br />
<references/></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fast_learning_algorithm_for_deep_belief_nets&diff=27746a fast learning algorithm for deep belief nets2017-08-30T13:46:33Z<p>Conversion script: Conversion script moved page A fast learning algorithm for deep belief nets to a fast learning algorithm for deep belief nets: Converting page titles to lowercase</p>
<hr />
<div>== Introduction ==<br />
<br />
The authors (Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh) present a method for using complementary priors to simplify the computation of posterior distributions in deep belief networks. Based on this, they are able to construct a fast greedy algorithm to learn weights in deep belief networks, one layer at a time. These weights may be improved using a contrastive version of the wake-sleep algorithm. The result is an efficient way to train a deep belief network with substantial accuracy, as is shown by top-notch scores in standard classification tasks such as MNIST digit recognition.<br />
<br />
The following figure shows the network used to model the joint distribution<br />
of digit images and digit labels<br />
<br />
[[File:Q1.png]]<br />
<br />
In this paper, each training<br />
case consists of an image and an explicit class label, but work<br />
in progress has shown that the same learning algorithm can<br />
be used if the labels are replaced by a multilayer pathway<br />
whose inputs are spectrograms from multiple different speakers<br />
saying isolated digits. The network then learns to generate<br />
pairs that consist of an image and a spectrogram of the same<br />
digit class.<br />
<br />
== Complementary priors ==<br />
<br />
One obstacle that has hindered the ability to make inference in directed belief nets is the "explaining away" phenomenon: it is extremely difficult, in general, to compute the posterior distribution over hidden variables in a dense directed belief network. <br />
<br />
The authors describe a way to cancel out the explaining away phenomenon in a hidden layer by using additional hidden layers to create what they refer to as "complementary priors". The idea behind this is that in a logistic belief net (a network where the probability of turning on a unit is a logistic function of the weighted states of its immediate ancestors), if there is only one hidden layer, then the posterior distribution is independent because it is created by the likelihood term coming from the data. Thus the complementary priors can be set so that they precisely make the posterior distribution factorial, and simplify the computation of posterior distributions. <br />
<br />
== A fast, greedy learning algorithm ==<br />
<br />
The main contribution of this paper is a fast greedy algorithm that can learn weights for a deep belief network. The idea of the algorithm is to construct multi-layer directed networks, one layer at a time. As each new layer is added, the overall generative model improves. The essence of the algorithm is similar to the concept of boosting, where the same weak learner is repeatedly used, but with different weighting on th￼ ￼ ￼ Cancel | Editing help (opens in new window)e data vector each time; however, in this case, it is the representation of the data vector that changes each time the weak learner is used. Here, the weak learner is an undirected graphical model. <br />
<br />
The figure below shows a hybrid network where the top two layers have undirected connections and the layers below have directed connections in both directions.<br />
h_0^T\,<br />
:[[File:DeepBeliefNet_derivation.jpg]]<br />
<br />
In the above diagram, the weight matrix <math>W_0\,</math> can be learned, to some level of accuracy, by assuming that all weight matrices are equal and treating the entire network as a Restricted Boltzmann Machine (RBM). Once <math>W_0\,</math> is learned, <math>W_0^T\,</math> can be used to map the data to a higher level in the first hidden layer, and a similar process can be repeated.<br />
<br />
In each stage, the higher level weight matrices would have to be modified. The following greedy algorithm is proposed:<br />
<br />
# Learn <math>W_0\,</math> assuming all the weight matrices are tied.<br />
# Freeze <math>W_0\,</math> and use <math>W_0^T\,</math> to infer factorial approximate posterior distributions over the states of the variables in the first hidden layer. Do this even though subsequent changes in the higher level weights mean that the inference is no longer always correct.<br />
# Keep all higher weight matrices tied to each other, but untie them from <math>W_0\,</math>. In this setting, learn an RBM for the higher level states, using results of the data having <math>W_0\,^T</math> applied as a transformation. <br />
<br />
The authors of the paper are able to show that if this greedy algorithm is used to change higher-level weight matrices, then the generative model is guaranteed to improve. , the negative log probability<br />
of a single data-vector, <math>v_0^T\,</math><br />
, under the multilayer generative<br />
model is bounded by a variational free energy which<br />
is the expected energy under the approximating distribution, <br />
<math>Q(h_0^T\,|v_0^T\,)</math> , minus the entropy of that distribution<br />
The only standard machine learning technique that comes<br />
close to the 1.25% error rate of our generative model on the<br />
basic task is a support vector machine which gives an error<br />
rate of 1.4%The only standard machine learning technique that comes<br />
close to the 1.25% error rate of our generative model on the<br />
basic task is a support vector machine which gives an error<br />
rate of 1.4%The only standard machine learning technique that comes<br />
close to the 1.25% error rate of our generative model on the<br />
basic task is a support vector machine which gives an error<br />
rate of 1.4%<br />
<br />
== The up-down algorithm ==<br />
<br />
The greedy learning algorithm is an effective and (relatively) rapid way to learn the weights in the deep belief network, but will not necessarily guarantee high quality weights. In order to obtain better weights, the "up-down" method has been suggested; this is a contrastive version the "wake-sleep" method proposed in a previous paper of 1995, although without some of the drawbacks. <br />
<br />
The idea is that, after weights have been learned in such a way that the posterior in each layer must be approximated with a factorial distribution given the values of the preceding layer, the upward "recognition" weights are untied from the downward "generative" weights. Then, higher-level weights can be used to influence lower-level ones. <br />
<br />
Each "up-pass" consists of using the recognition weights to stochastically pick states for each hidden variable, and then adjusting the generative weights using the following maximum likelihood learning rule: <br />
<br />
:<math>\frac{\partial \log p(v^0)}{\partial w_{ij}^{00}} = \langle h_j^0(v_i^0 - \hat{v_i^0})\rangle </math><br />
<br />
The "down-pass" is similar, in that it iterates through layers and adjusts weight matrices, although the iteration begins at the top layers and propagates along the top-down generative connections, and it is the bottom-up recognition weights that are modified. There is some discussion on how the performance may be superior to the similar "wake-sleep" process, if the implementation carries a particular "contrastive" quality.<br />
<br />
<br />
== Performance on MNIST ==<br />
<br />
The training method was applied on a deep belief net consisting of three hidden layers and approximately 1.7 million weights in an experiment to classify MNIST data on handwritten digits. On a basic version of the standard classification task, in which no geometric information is considered and no background knowledge or understanding is taken into account, the generalized performance of the network was approximately 1.25% error on the standard test set.<br />
<br />
The network was trained<br />
on 44,000 of the training images that were divided into 440<br />
balanced mini-batches each containing 10 examples of each<br />
digit class. The weights were updated after each mini-batch. <br />
In the initial phase of training, the greedy algorithm was used to train each layer of weights<br />
separately, starting at the bottom. Each layer was trained for<br />
30 sweeps through the training set (called “epochs”). During<br />
training, the units in the “visible” layer of each RBM hadThe only standard machine learning technique that comes<br />
close to the 1.25% error rate of our generative model on the<br />
basic task is a support vector machine which gives an error<br />
rate of 1.4%<br />
real-valued activities between 0 and 1. These were the normalized<br />
pixel intensities when learning the bottom layer of<br />
weights. For training higher layers of weights, the real-valued<br />
activities of the visible units in the RBM were the activation<br />
probabilities of the hidden units in the lower-level RBM. The<br />
hidden layer of each RBM used stochastic binary values when<br />
that RBM was being trained. The greedy training took a few<br />
hours per layer in Matlab on a 3GHz Xeon processor and<br />
when it was done, the error-rate on the test set was 2.49%.<br />
<br />
When training the top layer of weights (the ones in the<br />
associative memory) the labels were provided as part of the<br />
input. The labels were represented by turning on one unit in a<br />
“softmax” group of 10 units. When the activities in this group<br />
were reconstructed from the activities in the layer above, exactly<br />
one unit was allowed to be active and the probability of<br />
picking unit i was given by:<br />
<br />
[[File:Q2.png]]<br />
<br />
After the greedy layer-by-layer training, the network was<br />
trained, with a different learning rate and weight-decay, for<br />
300 epochs using the up-down algorithm described in section<br />
5. The learning rate, momentum, and weight-decay were<br />
chosen by training the network several times and observing<br />
its performance on a separate validation set of 10,000 images<br />
that were taken from the remainder of the full training<br />
set. For the first 100 epochs of the up-down algorithm, the<br />
up-pass was followed by three full iterations of alternating<br />
Gibbs sampling in the associative memory before performing<br />
the down-pass. For the second 100 epochs, six iterations<br />
were performed, and for the last 100 epochs, ten iterations<br />
were performed. Each time the number of iterations of Gibbs<br />
sampling was raised, the error on the validation set decreased<br />
noticeably.<br />
The network that performed best on the validation set was<br />
then tested and had an error rate of 1.39%. This network was<br />
then trained on all 60,000 training images until its error-rate<br />
on the full training set was as low as its final error-rate had<br />
been on the initial training set of 44,000 images. This took<br />
a further 59 epochs making the total learning time about a<br />
week. The final network had an error-rate of 1.25%.<br />
<br />
== Conclusion ==<br />
<br />
This paper has shown that it is possible to learn a deep, densely connected,<br />
belief network one layer at a time. The obvious<br />
way to do this is to assume that the higher layers do not exist<br />
when learning the lower layers, but this is not compatible<br />
with the use of simple factorial approximations to replace the<br />
intractable posterior distribution.<br />
<br />
This technique can also be viewed as constrained<br />
variational learning because a penalty term – the divergence<br />
between the approximate and true posteriors – has been replaced<br />
by the constraint that the prior must make the variational<br />
approximation exact.<br />
After each layer has been learned, its weights are untied<br />
from the weights in higher layers. As these higher-level<br />
weights change, the priors for lower layers cease to be com-<br />
plementary, so the true posterior distributions in lower layers<br />
are no longer factorial and the use of the transpose of the generative<br />
weights for inference is no longer correct.<br />
<br />
Some of the major advantages<br />
of generative models as compared to discriminative<br />
ones are:<br />
<br />
1. Generative models can learn low-level features without<br />
requiring feedback from the label and they can<br />
learn many more parameters than discriminative models<br />
without overfitting. In discriminative learning, each<br />
training case only constrains the parameters by as many<br />
bits of information as are required to specify the label.<br />
For a generative model, each training case constrains<br />
the parameters by the number of bits required to specify<br />
the input.<br />
2. It is easy to see what the network has learned by generating<br />
from its model.<br />
3. It is possible to interpret the non-linear, distributed representations<br />
in the deep hidden layers by generating images<br />
from them.<br />
4. The superior classification performance of discriminative<br />
learning methods only holds for domains in which<br />
it is not possible to learn a good generative model. This<br />
set of domains is being eroded by Moore’s law</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27743Distributed Representations of Words and Phrases and their Compositionality2017-08-30T13:46:33Z<p>Conversion script: Conversion script moved page Distributed Representations of Words and Phrases and their Compositionality to distributed Representations of Words and Phrases and their Compositionality: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[distributed Representations of Words and Phrases and their Compositionality]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_loss_surfaces_of_multilayer_networks_(Choromanska_et_al.)&diff=27745The loss surfaces of multilayer networks (Choromanska et al.)2017-08-30T13:46:33Z<p>Conversion script: Conversion script moved page The loss surfaces of multilayer networks (Choromanska et al.) to the loss surfaces of multilayer networks (Choromanska et al.): Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[the loss surfaces of multilayer networks (Choromanska et al.)]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_fast_learning_algorithm_for_deep_belief_nets&diff=27747A fast learning algorithm for deep belief nets2017-08-30T13:46:33Z<p>Conversion script: Conversion script moved page A fast learning algorithm for deep belief nets to a fast learning algorithm for deep belief nets: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[a fast learning algorithm for deep belief nets]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_Number_of_Linear_Regions_of_Deep_Neural_Networks&diff=27736on the Number of Linear Regions of Deep Neural Networks2017-08-30T13:46:32Z<p>Conversion script: Conversion script moved page On the Number of Linear Regions of Deep Neural Networks to on the Number of Linear Regions of Deep Neural Networks: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
The paper basically seeks to answer the question why deep neural networks perform so much better than shallow neural networks. It is not obvious that deep neural networks should perform any better. For example, Funahashi 1989 showed that a neural network with just one hidden layer is a universal function approximator (given sufficiently many neurons). Thus, the class of functions a deep neural network can approximate cannot be larger. Furthermore, having many layers can theoretically cause problems due to vanishing gradients.<br />
<br />
As both shallow and deep neural networks can approximate the same class of functions, another method of comparison is needed. For this we have to consider what neural networks do. Basically, they split the input space in piecewise linear units. It seems that deep neural networks have more segments (with the same number of neurons) which allows them to produce a more complex function approximate. Essentially, after partitioning the original input space piecewise linearly, each subsequent layer recognizes pieces of the original input such that the composition of these layers correspondingly identifies an exponential number of input regions. This is caused by the deep hierarchy which allows to apply the same computation across different regions of the input space.<br />
<br />
[[File:montifar1.png]]<br />
<br />
= Shallow Neural Networks =<br />
<br />
First, an upper limit of regions a shallow neural network produces is derived. This gives not only a measure of the approximation complexity possible with a shallow neural network, but will also be used to obtain the number of regions for deep neural networks.<br />
<br />
The hidden layer of a shallow neural network with <math>n_0</math> inputs and <math>n_1</math> hidden units essentially computes <math>\mathbf{x} \mapsto g(\mathbf{W}\mathbf{x} + \mathbf{b})</math> with input <math>\mathbf{x}</math>, weight matrix <math>\mathbf{W}</math>, bias vector <math>\mathbf{b}</math>, and non-linearity <math>\, g</math>. If the non-linearity of <math>g</math> is at 0 or if there is an inflection at 0, this gives a distinguished behavior for <math>\mathbf{W}\mathbf{x} <br />
+ \mathbf{b} = 0</math> which can act as decision boundary and represents a hyperplane.<br />
<br />
Let us consider the set <math>H_i := \{\mathbf{x} \in \mathbb{R}^{n_0}: \mathbf{W}_{i,:}\mathbf{x} <br />
+ \mathbf{b}_i = 0\}</math> of all those hyperplanes (<math>i \in [n_1]</math>). This set splits the input space in several regions (formally defined as a connected component of <math>\mathbb{R}^{n_0} <br />
\setminus (\cup_i H_i)</math>).<br />
<br />
[[File:hyperplanes.png]]<br />
<br />
With <math>n_1</math> hyperplanes (in general alignment) there will be at most <math>\sum_{j=0}^{n_0} \binom{n_1}{j}</math> regions.<br />
<br />
= Deep Neural Networks =<br />
<br />
A hidden layer <math>l</math> of a deep neural network computes a function <math>h_l</math> which maps a set <math>S_{l-1} \in \mathbb{R}^{n_{l-1}}</math> to another set <math>S_{l} \in <br />
\mathbb{R}^{n_l}</math>. In this mapping there might be subsets <math>\bar{R}_1, \dots, <br />
\bar{R}_k \subseteq S_{l-1}</math> that get mapped onto the same subset <math>R \subseteq <br />
S_l</math>, i.e. <math>h_l(\bar{R}_1) = \cdots = h_l(\bar{R}_k) = R</math>. The set of all these subsets is denoted with <math>P_R^l</math>.<br />
<br />
[[File:sets.png]]<br />
<br />
This allows to define the number of separate input-space neighbourhoods mapped onto a common neighbourhood <math>R</math>. For each subset <math>\bar{R}_i</math> that maps to <math>R</math> we have to add up the number of subsets mapping to <math>\bar{R}_i</math> giving the recursive formula <math>\mathcal{N}_R^l = \sum_{R' \in P_R^l} \mathcal{N}_{R'}^{l-1}</math> with <math>\mathcal{N}_R^0 = 1</math> for each region <math>R \subseteq \mathbb{R}^{n_0}</math> in the input space. Applying this formula for each distinct linear region computed by the last hidden layer, a set denoted with <math>P^L</math>, we get the maximal number of linear regions of the functions computed by an <math>L</math>-layer neural network with piecewise linear activations as <math>\mathcal{N} = \sum_{R \in P^L} \mathcal{N}_R^{L-1} \text{.}</math><br />
<br />
= Space Folding =<br />
<br />
An intuition of the process of mapping input-space neighbourhoods to common neighbourhoods can be given in terms of space folding. Each such mapping can be seen as folding the input space so that the input-space neighbourhoods are overlayed. Thus, each hidden layer of a deep neural network can be associated with a folding operator and any function computed on the final folded space will be applied to all regions successively folded onto each other. Note that the foldings are encoded in the weight matrix <math>\mathbf{W}</math>, bias vector <math>\mathbf{b}</math> and activation function <math>g</math>. This allows for foldings separate from the coordinate axes and non-length preserving foldings.<br />
<br />
[[File:montifar2.png]]<br />
[[File:montifar3.png]]<br />
<br />
= Deep Rectifier Networks =<br />
<br />
To obtain a lower bound on the maximal number of linear regions computable by a deep rectifier network, a network is constructed in such a way that the number of linear regions mapped onto each other is maximized. Each of <math>n</math> units in a layer of rectifiers will only process one of the <math>n_0</math> inputs. This gives a partition of rectifier units where each partition has a cardinality of <math>p <br />
= \lfloor n/n_0 \rfloor</math> (ignoring remaining units). For each subset <math>j</math> we select the <math>j</math>-th input with a row vector <math>\mathbf{w}</math> with the <math>j</math>-th entry 1 and the remaining entries 0. The bias values are included in these activation functions for the <math>p</math> units: <math>h_1(\mathbf{x}) = \max \{ 0, \mathbf{w}\mathbf{x} \}</math> <math>h_i(\mathbf{x}) = \max \{ 0, 2\mathbf{w}\mathbf{x} - i + 1 \}, \quad <br />
1 < i \leq p</math> Next, these activation functions are added with alternating signs. Note that this calculation can be absorbed in the connections weights to the next layer. <math>\tilde{h}_j(\mathbf{x}) = h_1(\mathbf{x}) - h_2(\mathbf{x}) <br />
+ h_3(\mathbf{x}) - \cdots + {(-1)}^{p-1} h_p(\mathbf{x})</math> This gives us a function which folds <math>p</math> segments <math>(-\infty, 0],\ [0, 1], [1, <br />
2],\ \ldots,\ [p - 1, \infty)</math> onto the interval <math>(0, 1)</math>.<br />
<br />
[[File:constr.png]]<br />
<br />
Going from these <math>n_0</math> functions for subsets of rectifiers to the full <math>n_0</math> dimensional function <math>\tilde{h} = {[\tilde{h}_1, \tilde{h}_2, \ldots, <br />
\tilde{h}_{n_0}]}^{\top}</math> gives a total of <math>p^{n_0}</math> hypercubes mapped onto the same output.<br />
<br />
Counting the number of separate regions produced by the last layer and multiplying this together with number of regions mapping to this layer, we get <math>\underbrace{\left( \prod_{i=1}^{L-1} {\left\lfloor \frac{n_i}{n_0} \right\rfloor}^{n_0} <br />
\right)}_{\text{mapped hypercubes}} \cdot \underbrace{\sum_{j=0}^{n_0} <br />
\binom{n_L}{j}}_{\text{last layer (shallow net)}}</math> as the lower bound of the maximal number of linear regions of functions computed by a deep rectifier network with <math>n_0</math> inputs and <math>L</math> hidden layers. We can also denote this lower bound with <math>\Omega\!\left({\left(\frac{n}{n_0}\right)}^{(L-1)n_o} n^{n_0}\right)</math> which makes it clear that this number grows exponentially with <math>L</math> versus a polynomial scaling of a shallow model with <math>nL</math> hidden units.<br />
<br />
In fact, it is possible to obtain asymptotic bounds on the number of linear regions per parameter in the neural network models:<br />
<br />
* For a deep model, the asymptotic bound is exponential: <math>\Omega\left(\left(n/n_0\right)^{n_0(L-1)}\frac{n^{n_0-2}}{L}\right)</math><br />
* For a shallow model, the asymptotic bound is polynomial: <math>O(L^{n_0-1}n^{n_0-1})</math><br />
<br />
= Conclusion =<br />
<br />
The number of piecewise linear segments the input space can be split into grows exponentially with the number of layers of a deep neural network, whereas the growth is only polynomial with the number of neurons. This explains why deep neural networks perform so much better than shallow neural networks. The paper showed this result for deep rectifier networks and deep maxout networks, but the same analysis should be applicable to other types of deep neural networks.<br />
<br />
Furthermore, the paper provides a useful intuition in terms of space folding to think about deep neural networks.</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_Manifold_Tangent_Classifier&diff=27738the Manifold Tangent Classifier2017-08-30T13:46:32Z<p>Conversion script: Conversion script moved page The Manifold Tangent Classifier to the Manifold Tangent Classifier: Converting page titles to lowercase</p>
<hr />
<div>== Introduction ==<br />
<br />
The goal in many machine learning problems is to extract information from data with minimal prior knowledge<ref name = "main"> Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., & Muller, X. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2011_1240.pdf The manifold tangent classifier.] In Advances in Neural Information Processing Systems (pp. 2294-2302). </ref> These algorithms are designed to work on numerous problems which they may not be specifically tailored towards, thus domain-specific knowledge is generally not incorporated into the models. However, some generic "prior" hypotheses are considered to aid in the general task of learning, and three very common ones are presented below:<br />
<br />
# The '''semi-supervised learning hypothesis''': This states that knowledge of the input distribution <math>p\left(x\right)</math> can aid in learning the output distribution <math>p\left(y|x\right)</math> .<ref>Lasserre, J., Bishop, C. M., & Minka, T. P. (2006, June). [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1640745 Principled hybrids of generative and discriminative models.] In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 1, pp. 87-94). IEEE.</ref> This hypothesis lends credence to not only the theory of strict semi-supervised learning, but also unsupervised pretraining as a method of feature extraction.<ref> Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 A fast learning algorithm for deep belief nets.] Neural computation, 18(7), 1527-1554.</ref><ref>Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. (2010). [http://delivery.acm.org/10.1145/1760000/1756025/p625-erhan.pdf?ip=129.97.89.222&id=1756025&acc=PUBLIC&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=561475515&CFTOKEN=96787671&__acm__=1447710319_1ea806f74c2b3b6959e97d9d0e03d533 Why does unsupervised pre-training help deep learning?.] The Journal of Machine Learning Research, 11, 625-660.</ref><br />
# The '''unsupervised manifold hypothesis''': This states that real-world data presented in high-dimensional spaces is likely to concentrate around a low-dimensional sub-manifold.<ref>Cayton, L. (2005). [http://www.vis.lbl.gov/~romano/mlgroup/papers/manifold-learning.pdf Algorithms for manifold learning.] Univ. of California at San Diego Tech. Rep, 1-17.</ref><br />
# The '''manifold hypothesis for classification''': This states that points of different classes are likely to concentrate along different sub-manifolds, separated by low-density regions of the input space.<ref name = "main"></ref><br />
<br />
The recently-proposed Contractive Auto-Encoder (CAE) algorithm has shown success in the task of unsupervised feature extraction,<ref name = "CAE">Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf Contractive auto-encoders: Explicit invariance during feature extraction.] In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 833-840).</ref> with its successful application in pre-training of Deep Neural Networks (DNN) an illustration of the merits of adopting '''Hypothesis 1'''. CAE also yields a mostly contractive mapping that is locally only sensitive to a few input directions, which implies that it models a lower-dimensional manifold (exploiting '''Hypothesis 2''') since the directions of sensitivity are in the tangent space of the manifold. <br />
<br />
This paper furthers the previous work by using the information about the tangent spaces by considering '''Hypothesis 3''': it extracts basis vectors for the local tangent space around each training point from the parameters of the CAE. Then, older supervised classification algorithms that exploit tangent directions as domain-specific prior knowledge can be used on the tangent spaces generated by CAE for fine-tuning the overall classification network. This approach seamlessly integrates all three of the above hypotheses and produces record-breaking results (for 2011) on image classification.<br />
<br />
== Contractive Auto-Encoders (CAE) and Tangent Classification ==<br />
<br />
The problem is to find a non-linear feature extractor for a dataset <math>\mathcal{D} = \{x_1, \ldots, x_n\}</math>, where <math>x_i \in \mathbb{R}^d</math> are i.i.d. samples from an unknown distribution <math> p\left(x\right)</math>.<br />
<br />
=== Traditional Auto-Encoders === <br />
<br />
A traditional auto-encoder learns an '''encoder''' function <math>h: \mathbb{R}^d \rightarrow \mathbb{R}^{d_h}</math> along with a '''decoder''' function <math>g: \mathbb{R}^{d_h} \rightarrow \mathbb{R}</math>, represented as <math>r = g\left(h\left(x\right)\right) </math>. <math>h\,</math> maps input <math>x\,</math> to the hidden input space, and <math>g\,</math> reconstructs <math>x\,</math>. When <math>L\left(x,g\left(h\left(x\right)\right)\right)</math> denotes the average reconstruction error, the objective function being optimized to learn the parameters <math>\theta\,</math> of the encoder/decoder is as follows:<br />
<br />
:<math> \mathcal{J}_{AE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) </math><br />
<br />
The form of the '''encoder''' is <math>h\left(x\right) = s\left(Wx + b_h\right)</math>, where <math>s\left(z\right) = \frac{1}{1 + e^{-z}}</math> is the element-wise logistic sigmoid. <math>W \in \mathbb{R}^{d_h \times d} </math> and <math>b_h \in \mathbb{R}^{d_h}</math> are the parameters (weight matrix and bias vector, respectively). The form of the '''decoder''' is <math>r = g\left(h\left(x\right)\right) = s_2\left(W^Th\left(x\right)+b_r\right)</math>, where <math>\,s_2 = s</math> or the identity. The weight matrix <math>W^T\,</math> is shared with the encoder, with the only new parameter being the bias vector <math>b_r \in \mathbb{R}^d</math>.<br />
<br />
The '''loss function''' can either be the squared error <math>L\left(x,r\right) = \|x - r\|^2</math> or the Bernoulli cross-entropy, given by: <br />
<br />
:<math> L\left(x, r\right) = -\sum_{i=1}^d \left[x_i \mbox{log}\left(r_i\right) + \left(1 - x_i\right)\mbox{log}\left(1 - r_i\right)\right]</math><br />
<br />
=== First- and Higher-Order Contractive Auto-Encoders ===<br />
<br />
==== Additional Penalty on Jacobian ==== <br />
<br />
The Contractive Auto-Encoder (CAE), proposed by Rifai et al.<ref name = "CAE"></ref>, encourages robustness of <math>h\left(x\right)</math> to small variations in <math>x</math> by penalizing the Frobenius norm of the encoder's Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. The new objective function to be minimized is:<br />
<br />
:<math> \mathcal{J}_{CAE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) + \lambda\|J\left(x\right)\|_F^2 </math><br />
<br />
where <math>\lambda</math> is a non-negative regularization parameter. We can compute the <math>j^{th}</math> row of the Jacobian of the sigmoidal encoder quite easily using the <math>j^{th}</math> row of <math>W</math>:<br />
<br />
:<math> J\left(x\right)_j = \frac{\partial h_j\left(x\right)}{\partial x} = h_j\left(x\right)\left(1 - h_j\left(x\right)\right)W_j</math><br />
<br />
==== Additional Penalty on Hessian ====<br />
<br />
It is also possible to penalize higher-order derivatives by approximating the Hessian (explicit computation of the Hessian is costly). It is sufficient to penalize the difference between <math>J\left(x\right)</math> and <math>J\left(x + \varepsilon\right)</math> where <math>\,\varepsilon </math> is small, as this represents the rate of change of the Jacobian. This yields the "CAE+H" variant, with objective function as follows:<br />
<br />
:<math> \mathcal{J}_{CAE+H}\left(\theta\right) = \mathcal{J}_{CAE}\left(\theta\right) + \gamma\sum_{x \in \mathcal{D}}\mathbb{E}_{\varepsilon\sim\mathcal{N}\left(0,\sigma^2I\right)} \left[\|J\left(x\right) - J\left(x + \varepsilon\right)\|^2\right] </math><br />
<br />
The expectation above, in practice, is taken over stochastic samples of the noise variable <math>\varepsilon\,</math> at each stochastic gradient descent step. <math>\gamma\,</math> is another regularization parameter. This formulation will be the one used within this paper.<br />
<br />
=== Characterizing the Tangent Bundle Captured by a CAE ===<br />
<br />
Although the regularization term encourages insensitivity of <math>h(x)</math> in all input space directions, the pressure to form an accurate reconstruction counters this somewhat, and the result is that <math>h(x)</math> is only sensitive to a few input directions necessary to distinguish close-by training points.<ref name = "CAE"></ref> Geometrically, the interpretation is that these directions span the local tangent space of the underlying manifold the characterizes the input data. <br />
<br />
==== Geometric Terms ====<br />
<br />
* '''Tangent Bundle''': The tangent bundle of a smooth manifold is the manifold along with the set of tangent planes taken at all points in it.<br />
* '''Chart''': A local Euclidean coordinate system equipped to a tangent plane. Each tangent plane has its own chart.<br />
* '''Atlas''': A collection of local charts.<br />
<br />
==== Conditions for Feature Mapping to Define an Atlas on a Manifold ====<br />
<br />
To obtain a proper atlas of charts, <math>h</math> must be a local diffeomorphism (locally smooth and invertible). Since the sigmoidal mapping is smooth, <math>\,h</math> is guaranteed to be smooth. To determine injectivity of <math>h\,</math>, consider the following, <math>\forall x_i, x_j \in \mathcal{D}</math>:<br />
<br />
:<math><br />
\begin{align}<br />
h(x_i) = h(x_j) &\Leftrightarrow s\left(Wx_i + b_h\right) = s\left(Wx_j + b_h\right) \\<br />
& \Leftrightarrow Wx_i + b_h = Wx_j + b_h \mbox{, since } s \mbox{ is invertible} \\<br />
& \Leftrightarrow W\Delta_{ij} = 0 \mbox{, where } \Delta_{ij} = x_i - x_j<br />
\end{align}<br />
</math><br />
<br />
Thus, as long as <math>W\,</math> forms a basis spanned by its rows <math>W_k\,</math> such that <math>\forall i,j \,\,\exists \alpha \in \mathbb{R}^{d_h} | \Delta_{ij} = \sum_{k=1}^{d_h}\alpha_k W_k</math>, then the injectivity of <math>h\left(x\right)</math> will be preserved (as this would imply <math>\Delta_{ij} = 0\,</math> above). Furthermore, if we limit the domain of <math>\,h</math> to <math>h\left(\mathcal{D}\right) \subset \left(0,1\right)^{d_h}</math>, containing only the values obtainable by <math>h\,</math> applied to the training set <math>\mathcal{D}</math>, then <math>\,h</math> is surjective by definition. Therefore, <math>\,h</math> will be bijective between <math>h\,</math> and <math>h\left(\mathcal{D}\right)</math>, meaning that <math>h\,</math> will be a local diffeomorphism around each point in the training set.<br />
<br />
==== Generating an Atlas from a Learned Feature Mapping ====<br />
<br />
We now need to determine how to generate local charts around each <math>x \in \mathcal{D}</math>. Since <math>h</math> must be sensitive to changes between <math>x_i</math> and one of its neighbours <math>x_j</math>, but insensitive to other changes, we expect this to be encoded in the spectrum of the Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. Thus, we define a local chart around <math>x</math> using the singular value decomposition of <math>\,J^T(x) = U(x)S(x)V^T(x)</math>. The tangent plane <math>\mathcal{H}_x</math> at <math>\,x</math> is then given by the span of the set of principal singular vectors <math>\mathcal{B}_x</math>, as long as the associated singular value is above a given small <math>\varepsilon\,</math>:<br />
<br />
:<math>\mathcal{B}_x = \{U_{:,k}(x) | S_{k,k}(x) > \varepsilon\} \mbox{ and } \mathcal{H}_x = \{x + v | v \in \mbox{span}\left(\mathcal{B}_x\right)\} </math><br />
<br />
where <math>U_{:,k}(x)\,</math> is the <math>k^{th}</math> column of <math>U\left(x\right)</math>. <br />
<br />
Then, we can define an atlas <math>\mathcal{A}</math> captured by <math>h\,</math>, based on the local linear approximation around each example:<br />
<br />
:<math> \mathcal{A} = \{\left(\mathcal{M}_x, \phi_x\right) | x\in\mathcal{D}, \phi_x\left(\tilde{x}\right) = \mathcal{B}_x\left(x - \tilde{x}\right)\}</math><br />
<br />
=== Exploiting Learned Directions for Classification ===<br />
<br />
We would like to use the local charts defined above as additional information for the task of classification. In doing so, we will adopt the '''manifold hypothesis for classification'''.<br />
<br />
==== CAE-Based Tangent Distance ====<br />
<br />
We start by defining the '''tangent distance''' between two points as the difference between their two respective hyperplanes <math>\mathcal{H}_x, \mathcal{H}_y</math> defined above, where distance is defined as:<br />
<br />
:<math> d\left(\mathcal{H}_x,\mathcal{H}_y\right) = \mbox{inf}\{\|z - w\|^2\,\, | \left(z,w\right) \in \mathcal{H}_x \times \mathcal{H}_y\}</math><br />
<br />
Finding this distance is a convex problem which is solvable by solving a system of linear equations.<ref>Simard, P., LeCun, Y., & Denker, J. S. (1993). [http://papers.nips.cc/paper/656-efficient-pattern-recognition-using-a-new-transformation-distance.pdf Efficient pattern recognition using a new transformation distance.] In Advances in neural information processing systems (pp. 50-58).</ref> Minimizing the distance in this way allows <math>x, y \in \mathcal{D}</math> to move along their associated tangent spaces, and have the distance evaluated where <math>x</math> and <math>y</math> are closest. A nearest-neighbour classifier could then be used based on this distance.<br />
<br />
==== CAE-Based Tangent Propagation ====<br />
<br />
Nearest-neighbour techniques work in theory, but are often impractical for large-scale datasets. Classifying test points in this way grows linearly with the number of training points. Neural networks, however, can quickly classify test points once they are trained. We would like the output <math>o</math> of the classifier to be insensitive to variations in the directions of the local chart around <math>x</math>. To this end, we add the following penalty to the objective function of the (supervised) network:<br />
<br />
:<math> \Omega\left(x\right) = \sum_{u \in \mathcal{B}_x} \left|\left| \frac{\partial o}{\partial x}\left(x\right) u \right|\right|^2 </math><br />
<br />
=== The Manifold Tangent Classifier (MTC) ===<br />
<br />
Finally, we are able to put all of the results together into a full algorithm for training a network. The steps follow below:<br />
<br />
# Train (unsupervised) a stack of <math>K\,</math> CAE+H layers as in section 2.2.2. Each layer is trained on the representation learned by the previous layer.<br />
# For each <math>x_i \in \mathcal{D}</math>, compute the Jacobian of the last layer representation <math>J^{(K)}(x_i) = \frac{\partial h^{(K)}}{\partial x}\left(x_i\right)</math> and its SVD. Note that <math>J^{(K)}\,</math> is the product of the Jacobians of each encoder. Store the leading <math>d_M\,</math> singular vectors in <math>\mathcal{B}_{x_i}</math>.<br />
# After the <math>K\,</math> CAE+H layers, add a sigmoidal output layer with a node for each class. Train the entire network for supervised classification, adding in the propagation penalty in 2.4.2. Note that for each <math>x_i, \mathcal{B}_{x_i}</math> contains the set of tangent vectors to use.<br />
<br />
== Related Work == <br />
<br />
There are a number of existing non-linear manifold learning algorithms (e.g. <ref>[http://web.mit.edu/cocosci/Papers/sci_reprint.pdf A Global Geometric Framework for Nonlinear Dimensionality Reduction] Tenenbaum et al., Science (2000)</ref> that learn the tangent bundle for a set of training points (i.e. the main directions of variation around each point). One drawback of these existing approaches is that they are typically non-parametric and use local parameters to define the tangent plane around each datapoint. This potentially results in manifold learning algorithms that require training data that grows exponentially with manifold dimension and curvature. <br />
<br />
The semi-supervised embedding algorithm <ref>[http://ronan.collobert.com/pub/matos/2008_deep_icml.pdf Deep learning via semi-supervised embedding] Weston et al., ICML (2008) </ref> is also related in that it encourages the hidden states of a network to be invariant with respect to changes to neighbouring datapoints in the training. The present work, however, initially aims for representations that are sensitive to such local variations, as explained above. <br />
<br />
== Results ==<br />
<br />
=== Datasets Considered ===<br />
<br />
The MTC was tested on the following datasets:<br />
<br />
*'''MNIST''': Set of 28 by 28 images of handwritten digits, and the goal is to predict the digit contained in the image.<br />
*'''Reuters Corpus Volume I''': Contains 800,000 real-world news stories. Used the 2000 most frequent words calculated on the whole dataset to create a bag-of-words representation.<br />
*'''CIFAR-10''': Dataset of 70,000 32 by 32 RGB real-world images. <br />
*'''Forest Cover Type''': Large-scale database of cartographic variables for prediction of forest cover types.<br />
<br />
=== Method ===<br />
<br />
To investigate the improvements made by CAE-learned tangents, the following method is employed: Optimal hyper-parameters (e.g. <math>\gamma, \lambda\,,</math> etc.) were selected by cross-validation on a disjoint validation set disjoint from the training set. The quality of the features extracted by the CAE is evaluated by initializing a standard multi-layer perceptron network with the same parameters as the trained CAE and fine-tuning it by backpropagation on the supervised task.<br />
<br />
=== Visualization of Learned Tangents === <br />
<br />
Figure 1 visualizes the tangents learned by CAE. The example is on the left, and 8 tangents are shown to the right. On the MNIST dataset, the tangents are small geometric transformations. For CIFAR-10, the tangents appear to be parts of the image. For Reuters, the tangents correspond to addition/removal of similar words, with the positive terms in green and the negative terms in red. We see that the tangents do not seem to change the class of the example (e.g. the tangents of the above "0" in MNIST all resemble zeroes).<br />
<br />
[[File:Figure_1_MTC.png|frame|center|Fig. 1: Tangents Extracted by CAE]]<br />
<br />
=== MTC in Semi-Supervised Setting ===<br />
<br />
The MTC method was evaluated on the MNIST dataset in a semi-supervised setting: the unsupervised feature extractor is trained on the full training set, and the supervised classifier is only trained on a restricted label set. The results with a single layer perceptron initialized with CAE+H pretraining (abbreviated CAE), and the same classifier with tangent propagation added (i.e. MTC) are in table 1. The performance is compared to other methods the do not consider the semi-supervised learning hypothesis (Support Vector Machines (SVM), Neural Networks (NN), Convolutional Neural Networks (CNN)), and those methods perform poorly against MTC, especially when labeled data is low. <br />
<br />
{| class="wikitable"<br />
|+Table 1: Semi-Supervised classification error on MNIST test set<br />
|-<br />
|'''# Labeled'''<br />
|'''NN'''<br />
|'''SVM'''<br />
|'''CNN'''<br />
|'''CAE'''<br />
|'''MTC'''<br />
|-<br />
|100<br />
|25.81<br />
|23.44<br />
|22.98<br />
|13.47<br />
|'''12.03'''<br />
|-<br />
|600<br />
|11.44<br />
|8.85<br />
|7.68<br />
|6.3<br />
|'''5.13'''<br />
|-<br />
|1000<br />
|10.7<br />
|7.77<br />
|6.45<br />
|4.77<br />
|'''3.64'''<br />
|-<br />
|3000<br />
|6.04<br />
|4.21<br />
|3.35<br />
|3.22<br />
|'''2.57''' <br />
|}<br />
<br />
=== MTC in Full Classification Problems ===<br />
<br />
We consider using MTC to classify using the full MNIST dataset (i.e. the fully supervised problem), and compare with other methods. The CAE used for tangent discovery is a two-layer deep network with 2000 units per-layer pretrained with the CAE+H objective. The MTC uses the same stack of CAEs trained with tangent propagation, using <math>d_M = 15\,</math> tangents. The MTC produces state-of-the-art results, achieving a 0.81% error on the test set (as opposed to the previous state-of-the-art result of 0.95% error, achieved by Deep Boltzmann Machines). Table 2 summarizes this result. Note that MTC also beats out CNN, which utilizes prior knowledge about vision using convolutions and pooling.<br />
<br />
{| class="wikitable"<br />
|+Table 2: Class. error on MNIST Test Set with full Training Set<br />
|-<br />
|K-NN<br />
|NN<br />
|SVM<br />
|CAE<br />
|DBM<br />
|CNN<br />
|MTC<br />
|-<br />
|3.09%<br />
|1.60%<br />
|1.40%<br />
|1.04%<br />
|0.95%<br />
|0.95%<br />
|'''0.81'''%<br />
|}<br />
<br />
A 4-layer MTC was trained on the Forest CoverType dataset. The MTC produces the best performance on this classification task, beating out the previous best method which used a mixture of non-linear SVMs (denoted as distributed SVM).<br />
<br />
{| class="wikitable"<br />
|+Table 3: Class. error on Forest Data<br />
|-<br />
|SVM<br />
|Distributed SVM<br />
|MTC<br />
|-<br />
|4.11%<br />
|3.46%<br />
|'''3.13'''%<br />
|}<br />
<br />
== Conclusion ==<br />
<br />
This paper unifies three common generic prior hypotheses in a unified manner. It uses a semi-supervised manifold approach to examine local charts around training points in the data, and then uses the tangents generated by these local charts to compare different classes. The tangents that are generated seem to be a meaningful decompositions of the training examples. When combining the tangents with the classifier, state-of-the-art results are obtained on classification problems in a variety of domains.<br />
<br />
== Discussion ==<br />
<br />
* I thought about how it could be possible to use an element-wise rectified linear unit <math>R\left(x\right) = \mbox{max}\left(0,x\right)</math> in place of the sigmoidal function for encoding, as this type of functional form has seen success in other deep learning methods. However, I believe that this type of functional form would preclude <math>h</math> from being diffeomorphic, as the <math>x</math>-values that are negative could not possibly be reconstructed. Thus, the sigmoidal form should likely be retained, although it would be interesting to see how other invertible non-linearities would perform (e.g. hyperbolic tangent).<br />
<br />
* It would be interesting to investigate applying the method of tangent extraction to other unsupervised methods, and then create a classifier based on these tangents in the same way that it is done in this paper. Further work could be done to apply this approach to clustering algorithms, kernel PCA, E-M, etc. This is more of a suggestion than a concrete idea, however.<br />
<br />
* It is not exactly clear to me how a <math>h</math> could ever define a true diffeomorphism, since <math>h: \mathbb{R}^{d} \mapsto \mathbb{R}^{d_h}</math>, where <math>d \ne d_h</math>, in general. Clearly, if <math>d > d_h</math>, such a map could not be injective. However, they may be able to "manufacture" the injectivity of <math>h</math> using the fact that <math>\mathcal{D}</math> is a discrete set of points. I'm not sure that this approach defines a continuous manifold, but I'm also not sure if that really matters in this case.<br />
<br />
== Bibliography ==<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Convolutional_Feature_Hierarchies_for_Visual_Recognition&diff=27740learning Convolutional Feature Hierarchies for Visual Recognition2017-08-30T13:46:32Z<p>Conversion script: Conversion script moved page Learning Convolutional Feature Hierarchies for Visual Recognition to learning Convolutional Feature Hierarchies for Visual Recognition: Converting page titles to lowercase</p>
<hr />
<div>=Overview=<br />
<br />
This paper<ref>Kavukcuoglu, K, Sermanet, P, Boureau, Y, Gregor, K, Mathieu, M, and Cun, Y. . Learning convolutional feature hierarchies for visual recognition. In Advances in neural information processing systems, 1090-1098, 2010.</ref> describes methods for learning features extracted through convolutional feature banks. In particular, it gives methods for using sparse coding convolutionally. This paper proposed to improve the feature extraction efficiency by jointly learning a feed-forward encoder with the convolutional filter bank, and applied the algorithm to Convolutional Networks (ConvNets), achieving impressive results on object recognition. In fact, Natural images, sounds, and more generally, signals that display translation invariance in any dimension, are better represented using convolutional dictionaries. sparse coding typically assumes that training image patches are independent from each other, and thus neglects the spatial correlation among them. In sparse coding, the sparse feature vector z is constructed to reconstruct the input x with a dictionary D. The procedure produces a code z* by minimizing the energy function:<br />
<br />
:<math>L(x,z,D) = \frac{1}{2}||x-Dz||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ min}} \ L(x,z,D)</math><br />
<br />
D is obtained by minimizing the above with respect to D: <math>\underset{z,D}{\operatorname{arg\ min}} \ L(x,z,D)</math>, averaged over the training set. The drawbacks to this method are that the representation is redundant and that the inference for a whole image is computationally expensive. The reason is that the system is trained on single image patches in most applications of sparse coding to image analysis, which produces a dictionary of filters that are essentially shifted versions of each other over the patch and reconstructed in isolation.<br />
<br />
This first problem can be addressed by applying sparse coding to the entire image and treating the dictionary as a convolutional filter bank. Invariance is classically achieved by regularization of the latent representation, e.g., by enforcing sparsity.<br />
<br />
:<math>L(x,z,D) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + |z|_1</math><br />
<br />
Where D<sub>k</sub> is an s×s filter kernel, x is a w×h image, z<sub>k</sub> is a feature map of dimension (w+s-1)×(h+s-1), and * denotes the discrete convolution operator.<br />
<br />
The second problem can be addressed by using a trainable feed-forward encoder to approximate the sparse code:<br />
<br />
:<math>L(z,z,D,W) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + \sum_{k=1}^K||z_k - f(W^k*x)||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ max}} \ L(x,z,D,W) </math><br />
<br />
Where W<sup>k</sup> is an encoding convolutional kernel of size s×s, and f is a point-wise non-linear function. Both the form of f and the method to find z* are discussed below. Here, they used feed-forward neural networks to approximate sparse codes generated by sparse coding and avoided solving computationally<br />
costly optimizations at runtime.<br />
<br />
The contribution of this paper is to address these two issues simultaneously, thus allowing convolutional approaches to sparse coding.<br />
<br />
=Method=<br />
<br />
The authors extend the coordinate descent sparse coding algorithm detailed in <ref>Li, Y and Osher, S. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.</ref> to use convolutional methods.<br />
<br />
Two considerations for learning convolution dictionaries are:<br />
#Boundary effects due to convolution must be handled.<br />
#Derivatives should be calculated efficiently.<br />
<br />
----<br />
'''function ConvCoD'''<math>\, (x,D,\alpha)</math><br />
<br />
:'''Set:''' <math>\, S = D^T*D</math><br />
<br />
:'''Initalize:''' <math>\, z = 0;\ \beta = D^T * mask(x)</math><br />
<br />
:'''Require:''' <math>\, h_\alpha:</math>: smooth thresholding function<br />
<br />
:'''repeat'''<br />
<br />
::<math>\, \bar{x} = h_\alpha(\beta)</math><br />
<br />
::<math>\, (k,p,q) = \underset{i,m,n}{\operatorname{arg\ max}} |z_{imn}-\bar{z_{imn}}|</math> (k: dictionary index, (p,q) location index)<br />
<br />
::<math>\, bi = \beta_{kpq}</math><br />
<br />
::<math>\, \beta = \beta + (z_kpg - \bar{z_{kpg}}) \times align(S(:,k,:,:),(p,q))</math> **<br />
<br />
::<math>\, z_{kpg} = \bar{z_{kpg}},\ \beta_{kpg} = bi</math><br />
<br />
:'''until''' change in <math>z</math> is below a threshold<br />
<br />
:'''end function'''<br />
----<br />
<nowiki>**</nowiki> MATLAB notation is used for slicing the tensor.<br />
<br />
The second important point in training convolutional dictionaries is the computation of the S =<br />
<math>D_{T}^T ∗ D </math> operator. For most algorithms like coordinate descent , FISTA and matching pursuit<br />
[12], it is advantageous to store the similarity matrix (S) explicitly and use a si128 convolutional filters (W)<br />
learned in the encoder using smooth shrinkage function. ngle column at<br />
a time for updating the corresponding component of code z. For convolutional modeling, the same<br />
approach can be followed with some additional care. In patch based sparse coding, each element<br />
<math>(i, j) </math> of S equals the dot product of dictionary elements i and j. Si128 convolutional filters (W)<br />
learned in the encoder using smooth shrinkage function. 128 convolutional filters (W)<br />
learned in the encoder using smooth shrinkage function. nce the similarity of a pair of<br />
dictionary elements has to be also considered in spatial dimensions, each term is expanded as “full”<br />
convolution of two dictionary elements <math> (i, j)</math>, producing <math>2s−1×2s−1</math> matrix. It is more convenient<br />
to think about the resulting matrix as a 4D tensor of size <math>K × K × 2s − 1 × 2s − 1 </math>. One should<br />
note that, depending on the input image size, proper alignment of corresponding column of this<br />
tensor has to be applied in the z space. One can also use the steepest descent algorithm for finding<br />
the solution to convolutional sparse coding given in equation 2, however using this method would<br />
be orders of magnitude slower compared to specialized algorithms like CoD and the solution<br />
would never contain exact zeros. In algorithm 1 we explain the extension of the coordinate descent<br />
algorithm for convolutional inputs. Having formulated convolutional sparse coding, the overall<br />
learning procedure is simple stochastic (online) gradient descent over dictionary D:<br />
In the above, <math>\beta = D^T * mask(x)</math> is use to handle boundary effects, where mask operates term by term and either puts zeros or scales down the boundaries.<br />
<br />
The learning procedure is then stochastic gradient descent over the dictionary D, where the columns of D are normalized after each iteration.<br />
<br />
:<math>\forall x^i \in X</math> training set: <math>z* = \underset{z}{\operatorname{arg\ max}}\ L(x^i,z,d), D \leftarrow D - \eta \frac {\partial(L,x^i,z^*,D}{\partial D}</math><br />
<br />
Two encoder architectures are tested. The first is steepest descent sparse coding with tanh encoding function using <math>g^k \times tanh(x*W^k)</math>, which does not include a shrinkage operator. Thus the ability to produce sparse representations is very limited.<br />
<br />
The second is convolutional CoD sparse coding with a smooth shrinkage operator as defined below. <br />
<br />
:<math>\tilde{z}=sh_{\beta^k,b^k}(x*W^k)</math> where k = 1..K.<br />
<br />
:<math>sh_{\beta^k,b^k}(s) = sign(s) \times 1/\beta^k \log(\exp(\beta^k \times |s|) - 1) - b^k</math><br />
<br />
where <math>\beta</math> controls the smoothness of the kink of shrinkage operator and b controls the location of the kink. The second system in more efficient in training, but the performance for both systems are almost identical.<br />
<br />
The following figure shows the Smooth shrinkage function, : Total loss as a function of number of iterat128 convolutional filters (W)<br />
learned in the encoder using smooth shrinkage function. ions and 128 convolutional filters (W)<br />
learned in the encoder using smooth shrinkage function. <br />
<br />
[[File:Q8.png]]<br />
<br />
The convolutional encoder can also be used in multi-stage object recognition architectures. For each stage, the encoder is followed by absolute value rectification, contrast normalization and average subsampling.<br />
<br />
=Experiments=<br />
<br />
Two systems are used:<br />
#Steepest descent sparse coding with tanh encoder: <math>SD^{tanh}</math><br />
#Coordinate descent sparse coding with shrink encoder: <math>CD^{shrink}</math><br />
<br />
==Object Recognition using Caltech-101 Dataset==<br />
<br />
In the Caltech-101 dataset, each image contains a single object. Each image is processed by converting to grayscale and resizing, followed by contrast normalization. All results use 30 training samples per class and 5 different choices of the training set.<br />
<br />
''Architecture:'' 64 features are extracted by the first layer, followed by a second layer that produces 256 features. Second layer features are connected to first layer features by a sparse connection table.<br />
<br />
''First Layer:'' Both systems are trained using 64 dictionary elements, where each dictionary item is a 9×9 convolution kernel. Both systems are trained for 10 sparsity values from 0.1-3.0.<br />
<br />
''Second Layer:'' In the second layer, each of 256 feature maps in connected to 16 randomly selected input features from the first layer.<br />
<br />
''One Stage System:'' In these results, the input is passed to the first layer, followed by absolute value rectification, contrast normalization, and average pooling. The output of the first layer is fed to a logistic classifier followed by the PMK-SVN classifier used in <ref>Lazebnik, S, Schmid, C, and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06, 2:2169–2178, 2006.</ref>.<br />
<br />
''Two Stage System:'' These results use both layers, followed by absolute value rectification, contrast normalization, and average pooling. Finally, a multinomial logistic regression classifier is used.<br />
<br />
[[File:CoD_results.png]]<br />
<br />
In the above, U represents one stage, UU represents two stages, and '+' represents supervised training is performed afterwards.<br />
Each row of filters connect a particular second layer feature map. It is seen that each row of filters extract<br />
similar features since their output response is summed together to form one output feature map.<br />
<br />
The following filters show the second stage filters. On the left, the Encoder kernels that correspond to the dictionary elements.<br />
On the right, the 128 dictionary elements, each row shows 16 dictionary elements, connecting to a single<br />
second layer feature map. It can be seen that each group extracts similar type of features from their<br />
corresponding inputs.<br />
<br />
[[File:Q9.png]]<br />
<br />
==Pedestrian Detection==<br />
<br />
The architecture is trained and evaluated on the INRIA Pedestrian dataset <ref>Dalal, N and Triggs, B. Histograms of oriented gradients for human detection. In Schmid, C, Soatto, S, and Tomasi, C, editors, CVPR’05, volume 2, pages 886–893, June 2005.</ref> which contains 2416 positive examples (after mirroring) and 1218 negative full images. For training, the dataset is augmented with minor translations and scaling, giving a total of 11370 examples for training and 1000 images for classification. The negative examples are augmented with larger scale variations to avoid false positives, giving a total of 9001 samples for training and 1000 for validation.<br />
<br />
The architecture for the pedestrian detection task is similar to that described in the previous section. It was trained both with and without unsupervised initialization, followed by supervised training. After one pass of training, the negative set was augmented with the 10 most offending samples on each full negative image.<br />
<br />
[[File:CoD_pedestrian_results.png]]<br />
<br />
=Discussion=<br />
*The paper presented an efficient method for convolutional training of feature extractors.<br />
*The resulting features look intuitively better than those obtained through non-convolutional methods, but classification results are only slightly better (where they're better at all) than existing methods.<br />
*It's not clear what effects in the pedestrian experiment are due to the method of preprocessing and variations on the dataset (scaling and translation) and which are due to the architecture itself. Comparisons are with other systems that processed input differently.<br />
* Unsupervised learning significantly helps to properly model extensive<br />
variations in the dataset where a pure supervised learning algorithm fails<br />
<br />
=References=<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27742distributed Representations of Words and Phrases and their Compositionality2017-08-30T13:46:32Z<p>Conversion script: Conversion script moved page Distributed Representations of Words and Phrases and their Compositionality to distributed Representations of Words and Phrases and their Compositionality: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>. This subsampling formula was chosen because it aggressively subsamples words whose frequency is greater than ''t'' while preserving the ranking of the frequencies. The subsampling formula was chosen heuristically, but it works well in practice. IT accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
Le et al<ref><br />
Le Q, Mikolov T. [http://arxiv.org/pdf/1405.4053v2.pdf "Distributed Representations of Sentences and Documents"]. Proceedings of the 31 st International Conference on Machine Learning, 2014 </ref> have used the idea of the current paper for learning paragraph vectors. In that later work they used paragraph vectors for prediction of the next word. Every word and also every paragraph is mapped to a unique vector represented in a column of two different matrices W and D. Then paragraph vectors and word vectors are concatenated to contribute for predicting the next word. <br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur-auto.png]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_the_Number_of_Linear_Regions_of_Deep_Neural_Networks&diff=27737On the Number of Linear Regions of Deep Neural Networks2017-08-30T13:46:32Z<p>Conversion script: Conversion script moved page On the Number of Linear Regions of Deep Neural Networks to on the Number of Linear Regions of Deep Neural Networks: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[on the Number of Linear Regions of Deep Neural Networks]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Manifold_Tangent_Classifier&diff=27739The Manifold Tangent Classifier2017-08-30T13:46:32Z<p>Conversion script: Conversion script moved page The Manifold Tangent Classifier to the Manifold Tangent Classifier: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[the Manifold Tangent Classifier]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_Convolutional_Feature_Hierarchies_for_Visual_Recognition&diff=27741Learning Convolutional Feature Hierarchies for Visual Recognition2017-08-30T13:46:32Z<p>Conversion script: Conversion script moved page Learning Convolutional Feature Hierarchies for Visual Recognition to learning Convolutional Feature Hierarchies for Visual Recognition: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[learning Convolutional Feature Hierarchies for Visual Recognition]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27730generating text with recurrent neural networks2017-08-30T13:46:31Z<p>Conversion script: Conversion script moved page Generating text with recurrent neural networks to generating text with recurrent neural networks: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function.In fact instead of computing and inverting the H matrix when updating equations, the Gauss-Newton approximation is used for the Hessian matrix which is quite good approximation to the Hessian and practically cheaper to compute. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= The RNN as a Generative Model =<br />
The goal of the model is to predict the next character given a string of characters. More formally, given a training sequence <math>(x_1,...,x_T)</math>, the RNN uses its output vectors <math>(o_1,...,o_T)</math> to obtain a sequence of predictive distributions <math>P(x_{t+1}|x_{\le t}) = softmax(o_t)</math>.<br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN.MRNNs already learn surprisingly good language models<br />
using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property, that enabled the MRNN to deal with real words that it did not see in the training set. one advantage of this model is that, this model avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
= Bibliography = <br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Long-Range_Vision_for_Autonomous_Off-Road_Driving&diff=27732learning Long-Range Vision for Autonomous Off-Road Driving2017-08-30T13:46:31Z<p>Conversion script: Conversion script moved page Learning Long-Range Vision for Autonomous Off-Road Driving to learning Long-Range Vision for Autonomous Off-Road Driving: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
Stereo-vision has been used extensively for mobile robots in identifying near-to-far obstacles in its path, but is limited by it's max range of 12 meters. For the safety of high speed mobile robots recognizing obstacles at longer ranges is vital.<br />
<br />
The authors of this paper proposed a "long-range vision vision system that uses self-supervised learning to train a classifier in real-time" <ref name="hadsell2009">Hadsell, Raia, et al. "Learning long‐range vision for autonomous off‐road driving." Journal of Field Robotics 26.2 (2009): 120-144.</ref>; to robustly increase the obstacle and path detection range to over 100 meters. This approach has been implemented and tested on the Learning Applied to Ground Robots (LAGR) provided by the National Robotics Engineering Center (NREC).<br />
<br />
= Related Work =<br />
<br />
A common approach to vision-based driving is to process images captured from a pair of stereo cameras, produce a point cloud and use various heuristics to build a traversability map <ref name="goldberg2002">Goldberg, Steven B., Mark W. Maimone, and Lany Matthies. "Stereo vision and rover navigation software for planetary exploration." Aerospace Conference Proceedings, 2002. IEEE. Vol. 5. IEEE, 2002.</ref> <ref name="kriegman1989">Kriegman, David J., Ernst Triendl, and Thomas O. Binford. "Stereo vision and navigation in buildings for mobile robots." Robotics and Automation, IEEE Transactions on 5.6 (1989): 792-803.</ref> <ref name="kelly1998">Kelly, Alonzo, and Anthony Stentz. "Stereo vision enhancements for low-cost outdoor autonomous vehicles." Int’l Conf. on Robotics and Automation, Workshop WS-7, Navigation of Outdoor Autonomous Vehicles. Vol. 1. 1998.</ref><br />
, there has been efforts to increase the range of stereo vision by using the color of nearby ground and obstacles, but these color based improvements can easily be fooled by shadows, monochromatic terrain and complex obstacles or ground types.<br />
<br />
More recent vision based approaches such as <ref name="hong2002">Hong, Tsai Hong, et al. "Road detection and tracking for autonomous mobile robots." AeroSense 2002 (2002): 311-319.</ref> <ref name="lieb2005">Lieb, David, Andrew Lookingbill, and Sebastian Thrun. "Adaptive Road Following using Self-Supervised Learning and Reverse Optical Flow." Robotics: Science and Systems. 2005.</ref> <ref name="dahlkamp2006">Dahlkamp, Hendrik, et al. "Self-supervised Monocular Road Detection in Desert Terrain." Robotics: science and systems. 2006.</ref> use learning algorithms to map traversability information to color histograms or geometric (point cloud) data has achieved success in the DARPA challenge.<br />
<br />
Other, non-vision-based systems have used the near-to-far learning paradigm to classify distant sensor data based on self-supervision from a reliable, close-range sensor. A self-supervised classifier was trained on satellite imagery and ladar sensor data for the Spinner vehicle’s navigation system<ref><br />
Sofman, Boris, et al. "Improving robot navigation through self‐supervised online learning." Journal of Field Robotics 23.11‐12 (2006): 1059-1075.<br />
</ref><br />
and an online self-supervised classifier for a ladar-based navigation system was trained to predict load-bearing surfaces in the presence of vegetation.<ref><br />
Wellington, Carl, and Anthony Stentz. "Online adaptive rough-terrain navigation vegetation." Robotics and Automation, 2004. Proceedings. ICRA'04. 2004 IEEE International Conference on. Vol. 1. IEEE, 2004.<br />
</ref><br />
<br />
= Challenges =<br />
<br />
* <span>'''Choice of Feature Representation''': For example how to choose a robust feature representation that is informative enough that avoid irrelevant transformations</span><br />
* <span>'''Automatic generation of Training labels''': Because the classifier devised is trained in real-time, it requires a constant stream of training data and labels to learn from.</span><br />
* <span>'''Ability to generalize from near to far field''': Objects captured by the camera scales inversely proportional to the distance away from the camera, therefore the system needs to take this into account and normalize the objects detected.</span><br />
<br />
= Overview of the Learning Process =<br />
<br />
[[Image:method.png|frame| center | 400px | alt=|Learning System Proposed by <ref name="hadsell2009" /> ]]<br />
<br />
The learning process described by is as follows:<br />
<br />
# <span>'''Pre-Processing and Normalization''': This step involves correcting the skewed horizon captured by the camera and normalizing the scale of objects captured by the camera, since objects captured scales inversely proportional to distance away from camera.</span><br />
# <span>'''Feature Extraction''': Convolutional Neural Networks were trained and used to extract features in order to reduce dimensionality.</span><br />
# <span>'''Stereo Supervisor Module''': Complicated procedure that uses multiple ground plane estimation, heuristics and statistical false obstacle filtering to generate class labels to close range objects in the normalized input. The goal is to generate training data for the classifier at the end of this learning process.</span><br />
# <span>'''Training and Classification''': Once the class labels and feature extraction training data is combined, it is fed into the classifier for real-time training. The classifier is trained on every frame and the authors have used stochastic gradient descent to update the classifier weights and cross entropy as the loss function.</span><br />
<br />
== Pre-Processing and Normalization ==<br />
<br />
At the first stage of the learning process there are two issues that needs addressing, namely the skewed horizon due to the roll of camera and terrain, secondly the true scale of objects that appear in the input image, since objects scale inversely proportional to distance away from camera, the objects need to be normalized to represent its true scale.<br />
<br />
[[Image:horizon_pyramid.png|frame| center | 400px | alt=|Horizon Pyramid <ref name="hadsell2009" /> <span data-label="fig:hpyramid"></span>]]<br />
<br />
To solve both issues, a normalized “pyramid” containing 7 sub-images are extracted (see figure above), where the top row of the pyramid has a range from 112 meters to infinity and the closest pyramid row has a range of 4 to 11 meters. These pyramid sub images are extracted and normalized from the input image to form the input for the next stage.<br />
<br />
[[Image:horizon_normalize.png|frame| center | 400px | alt=|Creating target sub-image <ref name="hadsell2009" /> <span data-label="fig:hnorm"></span>]]<br />
<br />
To obtain the scaled and horizon corrected sub images the authors have used a combination of a Hough transform and PCA robust refit to estimate the ground plane <math>P = (p_{r}, p_{c}, p_{d}, p_{o})</math>. Where <math>p_{r}</math> is the roll, <math>p_{c}</math> is the column, <math>p_{d}</math> is the disparity and <math>p_{o}</math> is the offset. Once the ground plane <math>P</math> is estimated, the horizon target sub-image <math>A, B, C, D</math> (see figure above) is computed by calculating the plane <math>\overline{EF}</math> with stereo disparity of <math>d</math> pixels. The following equations were used to calculate the center of the line <math>M</math>, the plane <math>\overline{EF}</math>, rotation <math>\theta</math> and finally points <math>A, B, C, D</math>.<br />
<br />
<math>\textbf{M}_{y} = \frac{p_{c} \textbf{M}_{x} + p_{d} d + p-{o}}{-p_{r}}</math><br />
<br />
<math>E = (<br />
\textbf{M}_{x} - \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} - \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>F = (<br />
\textbf{M}_{x} + \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} + \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>\theta = \left( <br />
\frac{\textbf{w}_{pc} + p_{d} + p_{o}}{-p_{r}}<br />
- \frac{p_{d} + p_{o}}{-p_{r}} / w<br />
\right)</math><br />
<br />
<math>A = (<br />
\textbf{E}_{x} + \alpha \sin \theta,<br />
\textbf{E}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>B = (<br />
\textbf{F}_{x} + \alpha \sin \theta,<br />
\textbf{F}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>C = (<br />
\textbf{F}_{x} - \alpha \sin \theta,<br />
\textbf{F}_{y} + \alpha \cos \theta,<br />
)</math><br />
<br />
<math>D = (<br />
\textbf{E}_{x} - \alpha \sin \theta,<br />
\textbf{E}_{y} + \alpha \cos \theta,invariance<br />
)</math><br />
<br />
The last step of this stage is that the images were converted from RGB to YUV, common in image processing pipelines.<br />
<br />
== Feature Extraction ==<br />
<br />
The goal of the feature extraction is to reduce the input dimensionality and increase the generality of the resulting classifier to be trained. Instead of using hand-tuned feature list, <ref name="hadsell2009" /> used a data driven approach and trained 4 different feature extractors, this is the only component of the learning process where it is trained off-line.<br />
<br />
<ul><br />
<li><p>'''Radial Basis Functions (RBF)''': A set of RBF were learned to form a feature vector by calculating the Euclidean distance between input window and each of the 100 RBF centers. Where each feature vector <math>D</math> has the form:</p><br />
<p><math>D_{j} = exp(-\beta^{i} || X - K^{i} ||^{2}_{2})</math></p><br />
<p>Where <math>\beta^{i}</math> is the inverse variance of the RBF center <math>K^{i}</math>, <math>X</math> is the input window, <math>K</math> is the set of <math>n</math> radial basis centers <math>K = \{K^{i} | i = 1 \dots n\}</math>.</p></li><br />
<li><p>'''Convolution Neural Network (CNN)''': A standard CNN was used, the architecture consisted of two layers, the first has 20 7x6 filters and the second has 369 6x5 filters. During training a 100 fully connected hidden neuron layer is added as a last layer to train with 5 outputs. Once the network is trained however that last layer was removed, and thus the resulting CNN outputs a 100 component feature vector. For training the authors random initialized the weights, used stochastic graident decent for 30 epochs, and <math>L^2</math> regularization. The network was trained against 450,000 labeled image patches, and tested against 50,000 labeled patches.</p></li><br />
<li><p>'''Supervised and Unsupervised Auto-Encoders''': Auto-Encoders or Deep Belief Networks <ref name="hinton2006">Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.</ref> <ref name="ranzato2007">Ranzato, Marc Aurelio, et al. "Unsupervised learning of invariant feature hierarchies with applications to object recognition." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.</ref> is a layer wise training procedure. The deep belief net trained has 3 layers, where the first and third are convolutional layers, and the second one is a maxpool layer, the architecture is explained in figure below. Since the encoder contains a maxpool layer, the decoder should have an unpool layer, but the author didn't specify which kind of unpool technique they use.</p><br />
[[Image:convo_arch.png|frame| center | 400px | alt=|Convolution Neural Network <ref name="hadsell2009" /> <span data-label="fig:convoarch"></span>]]<br />
<br />
<p>For training, the loss function is the mean squared loss between the original input and decoded picture. At first the network is trained with 10,000 unlabeled images (unsupervised training) with varying outdoor settings (150 settings), then the network is fined tuned with labeled dataset (supervised training), the authors did not mention how large the labeled dataset was, and what training parameters were used for the supervised stage.</p></li></ul><br />
<br />
== Stereo Supervisor Module ==<br />
<br />
[[Image:ground_plane_estimation.png|frame| center | 400px | alt=|Ground Plane Estimation <ref name="hadsell2009" /> <span data-label="fig:gplanes"></span>]]<br />
<br />
Once the images have been preprocessed and normalized, stereo vision algorithms are used to produce data samples and labels that are “visually consistent, error free and well distributed”. There are 4 steps at this stage:<br />
<br />
<ol><br />
<li><p><span>'''3D point cloud''': Step one, a 3D point cloud is produced by using the Triclops stereo vision algorithm from Point Grey Research. The algorithm used has a range of 12 to 15 meters and works by triangulating objects between two images to find the depth.</span></p></li><br />
<li><p>'''Estimation of ground plane''': Secondly a ground plane model is found by using a combination of Hough transform and principle component analysis (PCA) to fit a plane onto the point cloud <math>S = \{ (x^{i}, y^{i}, z^{i}) | i = 1 \dots n) \} </math>. Where <math>x^{i}, y^{i}, z^{i}</math> defines the position of point relative to the robot’s center, and <math>n</math> is the number of points in the point cloud.</p><br />
<p>The rational behind using Hough transform is since multiple ground planes can be found (see figure above), a voting system was introduced where by the parameter vector which denotes the ground plane parameter (such as pitch, roll and offset) and has the most votes is used. It is selected by the following equation:</p><br />
<p><math>X = P_{ijk} | i, j, k = argmax_{i,j,k} (V_{ijk})</math></p><br />
<p>Where <math>X</math> is the new plane estimate, <math>V</math> is a tensor that accumulates the votes and <math>P</math> is a tensor that records the plane parameter space. Then PCA is used to refit and compute the eigenvalue decomposition of the covariance matrix of the points <math>X^{1 \dots n}</math>.</p><br />
<p><math>\frac{1}{n} \sum^{n}_{1} X^{i} X^{i'} = Q \Lambda Q</math></p><br />
<p>It should be noted, however, that multiple ground planes does not eliminate all errors from the labeling process. The authors of this paper used the following heuristics to minimize the errors in the training data. The heuristic is and I quote:<br />
<br />
<blockquote><br />
{{Quote|text=}} <ref name="hadsell2009" /><br />
</blockquote><br />
<br />
<li><span>'''Projection''': Stereo vision has the limitation of only being able to robustly detect short range (12m max) objects. In an attempt to mitigate the uncertainty of long range objects, footlines of obstacles (the bottom outline of the obstacle) are used. This gives stereo vision better estimates about the scale and distance of long range objects. The footline of long range objects are found by projecting obstacle points onto the ground planes and marking high point-density regions.</span></p></li><br />
<li><p>'''Labeling''': Once the ground plane estimation, footline projections and obstacle points are found, ground map <math>G</math>, footline-map <math>F</math> and obstacle-map <math>O</math> can be produced.</p><br />
<p>Conventionally binary classifiers are used for terrain traversability, however, used a classifier that uses 5 labels:</p><br />
<ul><br />
<li><p><span>Super-traversable</span></p></li><br />
<li><p><span>Ground</span></p></li><br />
<li><p><span>Footline</span></p></li><br />
<li><p><span>Obstacle</span></p></li><br />
<li><p><span>Super-obstacle</span></p></li></ul><br />
<br />
[[Image:label_categories.png|frame| center | 400px | alt=|Label Categories <ref name="hadsell2009" /> <span data-label="fig:labelcategories"></span>]]<br />
<br />
<p>Where super-traversable and super-obstacle are high confidence labels that refer to input windows where only ground or obstacles are seen. Lower confidence labels such as ground and obstacle are used when there are mixture of points in the input window. Lastly footline labels are assigned when footline points are centered in the middle of the input window. The label criteria rules used by <ref name="hadsell2009" /> are outlined in figure below</p><br />
[[Image:label_criteria.png|frame| center | 400px | alt=|Label Criteria Rules <ref name="hadsell2009" /> <span data-label="fig:labelcriteria"></span>]]<br />
</li></ol><br />
<br />
== Training and Classification ==<br />
<br />
The real-time classifier is the last stage of the learning process. Due to its real-time nature the classifier has to be simple and efficient, therefore 5 logistic regression classifiers (one for each category) with a Kullback-Liebler divergence or relative entropy loss function and stochastic gradient descent was used. Additionally 5 ring buffer or circular buffer are used to store incoming data from the feature extraction and stereo supervisor. The ring buffer acts as a First In First Out (FIFO) queue and stores temporary data as it is being received and processed. The result is that the classifiers outputs a 5 component likelihood vectors for each input.<br />
<br />
= Experimental Results =<br />
<br />
== Performances of Feature Extractors ==<br />
<br />
[[Image:feature_extractors.png|frame| center | 400px | alt=|Comparision of Feature Extractors <ref name="hadsell2009" /> <span data-label="fig:featureextractors"></span>]]<br />
<br />
For testing the feature extractors, a dataset containing 160 hand labeled frames from over 25 log files were used, the log files can be further divided into 7 groups as seen in figure above, where it is a comparision of the 4 different feature extractors: Radial Basis Functions, Convolutional Neural Network, an Unsupervised Auto-Encoder and finally a supervised Auto-Encoder. In almost all cases it can be observed that the best feature extractor was the CNN trained with Auto-Encoders with the best average error rate of <math>8.46\%</math>.<br />
<br />
== Performances of Stereo Supervisor Module ==<br />
<br />
[[Image:stereo_module_comparison.png|frame| center | 400px | alt=|Stereo Module Performance <ref name="hadsell2009" /> <span data-label="fig:stereomodulecomparison"></span>]]<br />
<br />
To test the stereo module it was compared against the online classifier using the same ground truth dataset used in the previous section. As you can see from figure above the online classifier performs better than the stereo supervisor module, the authors note that it is due to the online classifier ability to smooth and regularize the noisy data <ref name="hadsell2009" />.<br />
<br />
== Field Test ==<br />
<br />
The online classifier was deployed onto a Learning Applied to Ground Robots (LAGR) vehicle provided by the National Robotics Engineering Center (NREC), and tested on three different courses. The system contains 2 processes running simultaneously, a 1-2 Hz online classifier outlined above, and a fast 8 - 10 Hz stereo based obstacle avoidance module. The combination of the both provides good long range and short range obstacle capabilities.<br />
<br />
The system was found to be most effective when long-range online classifier was combined with the short range module, as the short range only has a range of around 5 meters it often required human intervention to rescue the vehicle. No quantitative comparisons were given for these field tests, it is purely subjective and only tested during daytime.<br />
<br />
= Conclusion =<br />
<br />
This paper did not introduce novel ideas per se in terms of deep learning methods, however the application of deep learning methods (CNN + auto-encoders) along with stereo module to train a 5 label classifier shows great promise in increasing the road classification from a max range of 10 - 12 meters with purely stereo vision to over 100 meters is new in 2009 <ref name="hadsell2009" />.<br />
<br />
There were several issues with the experiments I have observed:<br />
<br />
* <span>There were no mention how many times the feature extractors were trained to obtain best parameters, nor the difficulty in training.</span><br />
* <span>All data and tests were performed during daytime, no mention of limitations at night.</span><br />
* <span>This paper did not compare itself against other state of the art systems such as <ref name="hong2002" /> <ref name="lieb2005" /> <ref name="dahlkamp2006" /> other than stereo vision based systems.</span><br />
* <span>In the plot of stereo vision vs online classifier did not contain error bars. Also on the x-axis the groundtruth frames are ordered by error difference, it would be interesting to see what would happen if it was time ordered instead, and whether it would tell us that stereo vision performs well at the beginning but poorly afterwards, supporting the authors claim that an online classifier is able to smooth and regularize the noisy data.</span><br />
* <span>Field tests lacked a quantitative measures to compare between the long range system against the short range system.</span><br />
<br />
= References =<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=27734continuous space language models2017-08-30T13:46:31Z<p>Conversion script: Conversion script moved page Continuous space language models to continuous space language models: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
This paper describes the use of a neural network language model for large vocabulary continuous speech recognition.<br />
The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability<br />
estimation in a continuous space. Highly efficient learning algorithms are described that enable the use of training<br />
corpora of several hundred million words. It is also shown that this approach can be incorporated into a large vocabulary<br />
continuous speech recognizer using a lattice re scoring framework at a very low additional processing time<br />
<br />
<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
The neural network language model has to perform two tasks: first, project all words of the context<br />
<math>\,h_j</math> = <math>\,w_{j-n+1}^{j-1}</math> onto a continuous space, and second, calculate the language model probability <math>P(w_{j}=i|h_{j})</math>. <br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math><br />
<br />
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:<br />
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector<br />
<br />
Finally, the output vector would be:<br />
<br />
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.<br />
<br />
The following figure shows the Architecture of the neural network language model. <math>\,h_j</math> denotes the context <math>\,w_{j-n+1}^{j-1}</math>. P is the size of one projection and H and N is the<br />
size of the second hidden and output layer, respectively. When short-lists are used the size of the output layer is much smaller than the size<br />
of the vocabulary.<br />
<br />
[[File:Q3.png]]<br />
<br />
In contrast to standard langua[[File:Qq.png]]ge modeling where we want to know the probability of a word i given its<br />
context, <math>P(w_{j} = i|h_{j}) </math>, the neural network simultaneously predicts the language model probability of all words<br />
in the word list:<br />
<br />
[[File:Q4.png]]<br />
<br />
= Optimization and Training =<br />
The training was done with standard back-propagation on minimizing the error function:<br />
<br />
<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math><br />
<br />
<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.<br />
<br />
The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.<br />
<br />
An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.<br />
<br />
===Lattice rescoring===<br />
<br />
It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:<br />
<br />
[[File:Lattice.PNG]]<br />
<br />
Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.<br />
<br />
After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.<br />
<br />
===Short List===<br />
<br />
In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.<br />
<br />
If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:<br />
<br />
[[File:shortlist.PNG]]<br />
<br />
Where L is the event that <math>\,w_t</math> is in the short-list.<br />
<br />
===Sorting and Bunch===<br />
<br />
The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.<br />
<br />
Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.<br />
<br />
= Training and Usage =<br />
<br />
The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:<br />
<br />
[[File:fast_training.PNG]]<br />
<br />
Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.<br />
<br />
= Results =<br />
<br />
In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:<br />
<br />
[[File:results1.PNG]]<br />
<br />
[[File:results2.PNG]]<br />
<br />
The following figure shows the word error rates on the 2003 evaluation test set for the back-off LM and the hybrid LM, trained only on CTS data (left bars for<br />
each system) and interpolated with the broadcast news LM (right bars for each system).<br />
<br />
[[File:Q6.png]]<br />
<br />
A perplexity reduction of about 9% relative is obtained independently of the size of the language model<br />
training data. This gain decreases to approximately 6% after interpolation with the back-off language model<br />
trained on the additional BN corpus of out-of domain data. It can be seen that the perplexity of the hybrid<br />
language model trained only on the CTS data is better than that of the back-off reference language model<br />
trained on all of the data (45.5 with respect to 47.5). Despite these rather small gains in perplexity, consistent<br />
word error reductions were observed.<br />
<br />
= Conclusion =<br />
<br />
This paper described the theory and an experimental evaluation of a new approach to language modeling for large vocabulary continuous speech recognition based on the idea to project the words onto a continuous space and to perform the probability estimation in this space. This method is fast to the level that the neural network language model can be used in a real-time speech recognizer. The necessary capacity of the neural network is an important issue. Three possibilities were explored: increasing the size of the hidden layer, training several networks and interpolating them together, and using large projection layers. Increasing the size of the hidden layer gave only modest improvements in word error,<br />
at the price of very long training times. In this respect, the second solution is more interesting as the networks<br />
can be trained in parallel. Large projection layers appear to be the best choice as this has little impact on the<br />
complexity during training or recognition.The neural network language model is able to cover different speaking styles, ranging from rather well formed speech with few errors (broadcast news) to very relaxed speaking with many errors in syntax and semantics (meetings and conversations). It is claimed that the combination of the developed neural network and a back-off language model can be considered as a serious alternative to the commonly used back-off language models alone.<br />
<br />
This paper also proposes to investigate new training criteria for the neural network language model. Language<br />
models are almost exclusively trained independently from the acoustic model by minimizing the perplexity<br />
on some development data, and it is well known that improvements in perplexity do not necessarily<br />
lead to reductions in the word error rate.<br />
<br />
The continuous representation of the words in the neural network language model offers new ways to perform<br />
constrained language model adaptation. For example, the continuous representation of the words can be<br />
changed so that the language model predictions are improved on some adaptation data, e.g., by moving some<br />
words closer together which appear often in similar contexts. The idea is to apply a transformation on the<br />
continuous representation of the words by adding an adaptation layer between the projection layer and the<br />
hidden layer. This layer is initialized with the identity transformation and then learned by training the neural<br />
network on the adaptation data. Several variants of this basic idea are possible, for example using shared<br />
block-wise transformations in order to reduce the number of free parameters.<br />
In comparison with back-off language models whose complexity increase exponentially with the length of context, complexity of neural network language models increase<br />
linearly with the order of the n-gram and with the size of the vocabulary. This linearly increase in parameters is an important practical advantage that<br />
enables us to consider longer span language models with a negligible increase of the memory and time complexity. <br />
<br />
<br />
The underlying idea of the continuous space language model described here is to perform the probability<br />
estimation in a continuous space. Although only neural networks were investigated in this work, the approach<br />
is not inherently limited to this type of probability estimator. Other promising candidates include Gaussian<br />
mixture models and radial basis function networks. These models are interesting since they can be more easily<br />
trained on large amounts of data than neural networks, and the limitation of a short-list at the output may not<br />
be necessary. The use of Gaussians makes it also possible to structure the model by sharing some Gaussians<br />
using statistical criteria or high-level knowledge. On the other hand, Gaussian mixture models are a non-discriminative<br />
approach. Comparing them with neural networks could provide additional insight into the success<br />
of the neural network language model.<br />
<br />
= Source =<br />
Schwenk, H. Continuous space language models. Computer Speech<br />
Lang. 21, 492–518 (2007). ISIArticle</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Generating_text_with_recurrent_neural_networks&diff=27731Generating text with recurrent neural networks2017-08-30T13:46:31Z<p>Conversion script: Conversion script moved page Generating text with recurrent neural networks to generating text with recurrent neural networks: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[generating text with recurrent neural networks]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_Long-Range_Vision_for_Autonomous_Off-Road_Driving&diff=27733Learning Long-Range Vision for Autonomous Off-Road Driving2017-08-30T13:46:31Z<p>Conversion script: Conversion script moved page Learning Long-Range Vision for Autonomous Off-Road Driving to learning Long-Range Vision for Autonomous Off-Road Driving: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[learning Long-Range Vision for Autonomous Off-Road Driving]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_space_language_models&diff=27735Continuous space language models2017-08-30T13:46:31Z<p>Conversion script: Conversion script moved page Continuous space language models to continuous space language models: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[continuous space language models]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=extracting_and_Composing_Robust_Features_with_Denoising_Autoencoders&diff=27722extracting and Composing Robust Features with Denoising Autoencoders2017-08-30T13:46:30Z<p>Conversion script: Conversion script moved page Extracting and Composing Robust Features with Denoising Autoencoders to extracting and Composing Robust Features with Denoising Autoencoders: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
This Paper explores a new training principle for unsupervised learning<br />
of a representation based on the idea of making the learned representations<br />
robust to partial corruption of the input pattern. This approach can<br />
be used to train autoencoders, and these denoising autoencoders can be<br />
stacked to initialize deep architectures. The algorithm can be motivated<br />
from a manifold learning and information theoretic perspective or from a<br />
generative model perspective.<br />
The proposed system is similar to a standard auto-encoder, which is trained with the objective function to learn a hidden representation which allows it to reconstruct its input. The difference between these two models is that the model is trained to reconstruct the original input from a corrupted version, generated by adding random noise to the data. This will result in extracting useful features.<br />
== Motivation ==<br />
<br />
The approach is based on the use of an unsupervised<br />
training criterion to perform a layer-by-layer initialization. The procedure is as follows :<br />
Each layer is at first trained to produce a higher level (hidden) representation of the observed patterns,<br />
based on the representation it receives as input from the layer below, by<br />
optimizing a local unsupervised criterion. Each level produces a representation<br />
of the input pattern that is more abstract than the previous level’s, because it<br />
is obtained by composing more operations. This initialization yields a starting<br />
point, from which a global fine-tuning of the model’s parameters is then performed<br />
using another training criterion appropriate for the task at hand.<br />
<br />
This process gives better solutions than the one obtained by random initializations<br />
<br />
= The Denoising Autoencoder =<br />
<br />
A Denoising Autoencoder reconstructs<br />
a clean “repaired” input from a corrupted, partially destroyed one. This<br />
is done by first corrupting the initial input <math>x</math> to get a partially destroyed version<br />
<math>\tilde{x}</math> by means of a stochastic mapping. This means<br />
that the autoencoder must learn to compute a representation<br />
that is informative of the original input even<br />
when some of its elements are missing. This technique<br />
was inspired by the ability of humans to have an appropriate<br />
understanding of their environment even in<br />
situations where the available information is incomplete<br />
(e.g. when looking at an object that is partly<br />
occluded). In this paper the noise is added by randomly zeroing a fixed number, <math>v_d</math>, of components and leaving the rest untouched. This is similar to salt noise in images where we see random white background areas in an image.<br />
<br />
As shown in the figure below, the clean input <math>x</math> is mapped to some corrupted version according to some conditional distribution <math>q_D(\sim{x}|x)</math>. The corrupted version is then mapped to some informative domain <math>y</math>, and the autoencoder then attempts to reconstruct the clean version <math>x</math> from <math>y</math>. Thus the objective function can be described as<br />
[[File:W1.png]]<br />
<br />
The objective function minimized by<br />
stochastic gradient descent becomes: <br />
[[File:W2.png]]<br />
<br />
where the loss function is the cross entropy of the model<br />
The denoising autoencoder can be shown in the figure as <br />
<br />
[[File:W3.png]]<br />
<br />
It is important to note that usually the dimensionality of the hidden layer needs to be less than the input/output layer in order to avoid the trivial solution of identity mapping, but in this case that is not a problem since randomly zeroing out numbers causes the identity map to fail. This forces the network to learn a more abstract representation of the data regardless of the relative sizes of the layers.<br />
<br />
= Layer-wise Initialization and Fine Tuning =<br />
<br />
While training the denoising autoencoder k-th layer used as<br />
input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been<br />
trained. After a few layers have been trained, the parameters are used as initialization<br />
for a network optimized with respect to a supervised training criterion.<br />
This greedy layer-wise procedure has been shown to yield significantly better<br />
local minima than random initialization of deep networks,<br />
achieving better generalization on a number of tasks.<br />
<br />
= Analysis of the Denoising Autoencoder =<br />
== Manifold Learning Perspective ==<br />
<br />
<br />
The process of mapping a corrupted example to an uncorrupted one can be<br />
visualized in Figure 2, with a low-dimensional manifold <math>\mathcal{M}</math> near which the data<br />
concentrate. We learn a stochastic operator <math>p(X|\tilde{X})</math> that maps an <math>\tilde{X}</math> to an <math>X\,</math>.<br />
<br />
<br />
[[File:q4.png]]<br />
<br />
Since the corrupted points <math>\tilde{X}</math> will likely not be on <math>\mathcal{M}</math>, the learned map <math>p(X|\tilde{X})</math> is able to determine how to transform points away from <math>\mathcal{M}</math> into points on <math>\mathcal{M}</math>.<br />
<br />
The denoising autoencoder can thus be seen as a way to define and learn a<br />
manifold. The intermediate representation <math>Y = f(X)</math> can be interpreted as a<br />
coordinate system for points on the manifold (this is most clear if we force the<br />
dimension of <math>Y</math> to be smaller than the dimension of <math>X</math>). More generally, one can<br />
think of <math>Y = f(X)</math> as a representation of <math>X</math> which is well suited to capture the<br />
main variations in the data, i.e., on the manifold. When additional criteria (such<br />
as sparsity) are introduced in the learning model, one can no longer directly view<br />
<math>Y = f(X)</math> as an explicit low-dimensional coordinate system for points on the<br />
manifold, but it retains the property of capturing the main factors of variation<br />
in the data.<br />
<br />
== Stochastic Operator Perspective ==<br />
<br />
The denoising autoencoder can also be seen as corresponding to a semi-parametric model that can be sampled from. Define the joint distribution as follows: <br />
<br />
:<math>p(X, \tilde{X}) = p(\tilde{X}) p(X|\tilde{X}) = q^0(\tilde{X}) p(X|\tilde{X}) </math> <br />
<br />
from the stochastic operator <math>p(X | \tilde{X})</math>, with <math>q^0\,</math> being the empirical distribution.<br />
<br />
Using the Kullback-Leibler divergence, defined as:<br />
<br />
:<math>\mathbb{D}_{KL}(p|q) = \mathbb{E}_{p(X)} \left(\log\frac{p(X)}{q(X)}\right) </math><br />
<br />
then minimizing <math>\mathbb{D}_{KL}(q^0(X, \tilde{X}) | p(X, \tilde{X})) </math> yields the originally-formulated denoising criterion. Furthermore, as this objective is minimized, the marginals of <math>\,p</math> approach those of <math>\,q^0</math>, i.e. <math> p(X) \rightarrow q^0(X)</math>. Then, if <math>\,p</math> is expanded in the following way:<br />
<br />
:<math> p(X) = \frac{1}{n}\sum_{i=1}^n \sum_{\tilde{\mathbf{x}}} p(X|\tilde{X} = \tilde{\mathbf{x}}) q_{\mathcal{D}}(\tilde{\mathbf{x}} | \mathbf{x}_i) </math><br />
<br />
it becomes clear that the denoising autoencoder learns a semi-parametric model that can be sampled from (since <math>p(X)</math> above is easy to sample from). <br />
<br />
== Information Theoretic Perspective ==<br />
<br />
It is also possible to adopt an information theoretic perspective. The representation of the autonencoder should retain as much information as possible while at the same time certain properties, like a limited complexity, are imposed on the marginal distribution. This can be expressed as an optimization of <math>\arg\max_{\theta} \{I(X;Y) + \lambda \mathcal{J}(Y)\}</math> where <math>I(X; Y)</math> is the mutual information between an input sample <math>X</math> and the hidden representation <math>Y</math> and <math>\mathcal{J}</math> is a functional expressing the preference over the marginal. The hyper-parameter <math>\lambda</math> controls the trade-off between maximazing the mutual information and keeping the marginal simple.<br />
<br />
Note that this reasoning also applies to the basic autoencoder, but the denoising autoencoder maximizes the mutual information between <math>X</math> and <math>Y</math> while <math>Y</math> can also be a function of corrupted input.<br />
<br />
== Generative Model Perspective ==<br />
<br />
This section tries to recover the training criterion for denoising autoencoder. The section of 'information theoretic Perspective' is equivalent to maximizing a variational bound on a particular generative model. The final training criterion found is to maximize <math> \bold E_{q^0(\tilde{x})}[L(q^0, \tilde{X})] </math>, where <math> L(q^0, \tilde{X}) = E_{q^0(X,Y | \tilde{X})}[log\frac{p(X, \tilde{X}, Y)}{q^0(X, Y | \tilde(X))}] </math><br />
<br />
= Experiments =<br />
The Input contains different<br />
variations of the MNIST digit classification problem, with added factors of<br />
variation such as rotation (rot), addition of a background composed of random<br />
pixels (bg-rand) or made from patches extracted from a set of images (bg-img), or<br />
combinations of these factors (rot-bg-img). These variations render the problems particularly challenging for current generic learning algorithms. Each problem<br />
is divided into a training, validation, and test set (10000, 2000, 50000 examples<br />
respectively). A subset of the original MNIST problem is also included with the<br />
same example set sizes (problem basic). The benchmark also contains additional<br />
binary classification problems: discriminating between convex and non-convex<br />
shapes (convex), and between wide and long rectangles (rect, rect-img).<br />
Neural networks with 3 hidden layers initialized by stacking denoising autoencoders<br />
(SdA-3), and fine tuned on the classification tasks, were evaluated<br />
on all the problems in this benchmark. Model selection was conducted following<br />
a similar procedure as Larochelle et al. (2007). Several values of hyper<br />
parameters (destruction fraction ν, layer sizes, number of unsupervised training<br />
epochs) were tried, combined with early stopping in the fine tuning phase. For<br />
each task, the best model was selected based on its classification performance<br />
on the validation set.<br />
The results can be reported in the following table.<br />
[[File:W5.png]]<br />
<br />
The filter obtained by training are shown the the figure below<br />
<br />
<br />
[[File:Qq3.png]]<br />
<br />
= Conclusion and Future Work =<br />
<br />
The paper shows a denoising Autoencoder which was motivated by the goal of<br />
learning representations of the input that are robust to small irrelevant changes<br />
in input. Several perspectives also help to motivate it from a manifold learning<br />
perspective and from the perspective of a generative model.<br />
This principle can be used to train and stack autoencoders to initialize a<br />
deep neural network. A series of image classification experiments were performed<br />
to evaluate this new training principle. The empirical results support<br />
the following conclusions: unsupervised initialization of layers with an explicit<br />
denoising criterion helps to capture interesting structure in the input distribution.<br />
This in turn leads to intermediate representations much better suited for<br />
subsequent learning tasks such as supervised classification. The experimental<br />
results with Deep Belief Networks (whose layers are initialized as RBMs) suggest<br />
that RBMs may also encapsulate a form of robustness in the representations<br />
they learn, possibly because of their stochastic nature, which introduces noise<br />
in the representation during training.<br />
<br />
= References =<br />
<br />
Bengio, Y. (2007). Learning deep architectures for AI (Technical Report 1312).<br />
Universit´e de Montr´eal, dept. IRO.<br />
<br />
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layerwise<br />
training of deep networks. Advances in Neural Information Processing<br />
Systems 19 (pp. 153–160). MIT Press.<br />
<br />
Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In<br />
L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel<br />
machines. MIT Press.<br />
<br />
Doi, E., Balcan, D. C., & Lewicki, M. S. (2006). A theoretical analysis of<br />
robust coding over noisy overcomplete channels. In Y. Weiss, B. Sch¨olkopf<br />
and J. Platt (Eds.), Advances in neural information processing systems 18,<br />
307–314. Cambridge, MA: MIT Press.<br />
<br />
Doi, E., & Lewicki, M. S. (2007). A theory of retinal population coding. NIPS<br />
(pp. 353–360). MIT Press.<br />
<br />
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant<br />
representations over learned dictionaries. IEEE Transactions on Image Processing,<br />
15, 3736–3745.<br />
<br />
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires<br />
associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette<br />
<br />
Hammond, D., & Simoncelli, E. (2007). A machine learning framework for adaptive<br />
combination of signal denoising methods. 2007 International Conference<br />
on Image Processing (pp. VI: 29–32).<br />
<br />
Hinton, G. (1989). Connectionist learning procedures. Artificial Intelligence,<br />
40, 185–234.<br />
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data<br />
with neural networks. Science, 313, 504–507.<br />
<br />
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for<br />
deep belief nets. Neural Computation, 18, 1527–1554.<br />
<br />
Hopfield, J. (1982). Neural networks and physical systems with emergent collective<br />
computational abilities. Proceedings of the National Academy of Sciences,<br />
USA, 79.<br />
<br />
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007).<br />
An empirical evaluation of deep architectures on problems with many factors<br />
of variation. Twenty-fourth International Conference on Machine Learning<br />
(ICML’2007).<br />
<br />
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Doctoral dissertation,<br />
Universit´e de Paris VI.<br />
<br />
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual<br />
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in<br />
neural information processing systems 20. Cambridge, MA: MIT Press.<br />
<br />
McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel<br />
distributed processing: Explorations in the microstructure of cognition, vol. 2.<br />
Cambridge: MIT Press.<br />
<br />
Memisevic, R. (2007). Non-linear latent factor models for revealing structure<br />
in high-dimensional data. Doctoral dissertation, Departement of Computer<br />
Science, University of Toronto, Toronto, Ontario, Canada.<br />
<br />
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for<br />
deep belief networks. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.),<br />
Advances in neural information processing systems 20. Cambridge, MA: MIT<br />
Press.<br />
<br />
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning<br />
of sparse representations with an energy-based model. Advances in Neural<br />
Information Processing Systems (NIPS 2006). MIT Press.<br />
<br />
Roth, S., & Black, M. (2005). Fields of experts: a framework for learning image<br />
priors. IEEE Conference on Computer Vision and Pattern Recognition (pp.<br />
860–867).</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27724very Deep Convoloutional Networks for Large-Scale Image Recognition2017-08-30T13:46:30Z<p>Conversion script: Conversion script moved page Very Deep Convoloutional Networks for Large-Scale Image Recognition to very Deep Convoloutional Networks for Large-Scale Image Recognition: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper<ref><br />
Simonyan, Karen, and Andrew Zisserman. [http://arxiv.org/pdf/1409.1556.pdf "Very deep convolutional networks for large-scale image recognition."] arXiv preprint arXiv:1409.1556 (2014).</ref> the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the only preprocessing step is to subtract the mean RBG value computed on the training data. Then, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3 with a convolutional stride of 1 pixel. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
= Classification Framework =<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
===Training===<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively fewer epochs to converge due to the following reasons:<br />
(a) implicit regularization imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialization of certain layers.<br />
<br />
With respect to (b) above, the shallowest configuration (A in the previous table) was trained using random initialization. For all the other configurations, the first four convolutional layers and the last 3 fully connected layers were initialized with the corresponding parameters from A, to avoid getting stuck during training due to a bad initialization. All other layers were randomly initialized by sampling from a normal distribution with 0 mean. The author also mentioned that they find it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio<ref><br />
Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." International conference on artificial intelligence and statistics. 2010.<br />
</ref><br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
===Implementation===<br />
<br />
To improve overall training speed of each model, the researchers introduced parallelization to the mini batch gradient descent process. Since the model is very deep, training on a single GPU would take months to finish. To speed up the process, the researchers trained separate batches of images on each GPU in parallel to calculate the gradients. For example, with 4 GPUs, the model would take 4 batches of images, calculate their separate gradients and then finally take an average of four sets of gradients as training. (Krizhevsky et al., 2012) introduced more complicated ways to parallelize training convolutional neural networks but the researchers found that this simple configuration speed up training process by a factor of 3.75 with 4 GPUs and with a possible maximum of 4, the simple configuration worked well enough. <br />
Finally, it took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
===Testing===<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
= Classification Experiments =<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
== Single-Scale Evaluation ==<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
== Multi-Scale Evaluation ==<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Comparison With The State Of The Art ==<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
= Appendix A: Localization =<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Details and more results on these competitions can be found here.<ref><br />
Russakovsky, Olga, et al. [http://arxiv.org/pdf/1409.0575v3.pdf "Imagenet large scale visual recognition challenge."] International Journal of Computer Vision (2014): 1-42.<br />
</ref> They also showed that their configuration is applicable to some other datasets.<br />
<br />
= Resources =<br />
<br />
The Oxford Visual Geometry Group (VGG) has released code for their 16-layer and 19-layer models. The code is available on their [http://www.robots.ox.ac.uk/~vgg/research/very_deep/ website] in the format used by the [http://caffe.berkeleyvision.org/ Caffe] toolbox and includes the weights of the pretrained networks.<br />
<br />
=References=<br />
<references /><br />
<br />
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_Wake-Sleep_Algorithm_for_Unsupervised_Neural_Networks&diff=27726the Wake-Sleep Algorithm for Unsupervised Neural Networks2017-08-30T13:46:30Z<p>Conversion script: Conversion script moved page The Wake-Sleep Algorithm for Unsupervised Neural Networks to the Wake-Sleep Algorithm for Unsupervised Neural Networks: Converting page titles to lowercase</p>
<hr />
<div>=Introduction=<br />
<br />
In considering general learning procedures, supervised methods for neural networks are limited in that they can only be executed in specifically-structured environments. For these systems to learn, the environment must be equipped with an external "teacher" providing the network with explicit feedback on its predictive performance. From there, the system needs a means for circulating this error information across the entire network, so that the weights can be adjusted accordingly. An additional problem is complicated models such as Sigmoid Belief Networks (SBN) are difficult to learn because the posterior distribution is difficult to infer.<br />
<br />
The authors' idea is to assume that the posterior over hidden configurations at each hidden layer factorizes into a product of distributions for each separate hidden unit, thus they purposed the ''wake-sleep algorithm'', a two-phase procedure in which each network layer effectively learns representations of the activity in adjacent hidden layers. Here, the network is composed of feed-forward "recognition" connections used to generate an internal representation of the input, and feed-back generative connections used to produce an estimated reconstruction of the original input based on this learned internal representation. The goal is to learn an efficient representation which accurately characterizes the input to the system.<br />
<br />
==Why is the Posterior Distribution Intractable?==<br />
Lets say we were given a dataset <math>D = \{ x_1, \dots, x_{n} \}</math>, according the the Bayes Rule:<br />
<br />
<math><br />
P (\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}<br />
</math><br />
<br />
Where:<br />
* <math>P(D | \theta)</math>: is the likelihood function of <math>\theta</math><br />
* <math>P(\theta)</math>: is the prior probability of <math>\theta</math><br />
* <math>P(\theta | D)</math>: is the posterior distribution over <math>\theta</math><br />
<br />
Things get tricky when one would like to calculate the predictive distribution:<br />
<br />
<math><br />
P (x | D) = \int P(x | \theta, D) P(\theta | D) d\theta<br />
</math><br />
<br />
It its the integral that is the problem, since the integral cannot be solved by simply approximating the integral numerically due to its high dimensional space, it is deemed intractable.<br />
<br />
=Model Structure=<br />
The wake-sleep algorithm is used to train Helmholtz machines, networks possessing the feed-forward and feed-back connections as described above. In deploying a Helmholtz machine, we hope to encode an abstraction capturing the relevant properties of the data, and using this representation, find a generative reconstruction of the original input. This is analogous to learning a data-driven "bottom-up" representation of the input that is used to inform the higher-order "top-down" reconstruction. <br />
<br />
<center><br />
[[File:Network.png |frame | center |Figure 1: The Helmholtz network structure. ]]<br />
</center><br />
<br />
In this figure, <math>s_i</math>, <math>s_j</math>, and <math>s_k</math> refer to the binary activation values of units in each of the networks layers <math>I</math>, <math>J</math>, and <math>K</math>. The values <math>p_j</math> and <math>q_j</math> denote the probabilities that unit <math>s_j</math> will be active during generation and recognition, respectively. <math>p_j</math> and <math>q_j</math> are determined by the generative weights and recognition weights into the unit, along with the activities of connected units in the layers above and below. The probabilities are computed using the logistic function, as described below. <br />
<br />
==Selecting a Cost==<br />
<br />
To enforce the requirement that the network produces efficient reconstructions of the data, the cost function is selected by viewing the problem as a task of information transmission. In this task, the original input vector is to be indirectly communicated from the sender to the receiver via first sending the internal representation of the datum learned by the system, and then passing along the deviation of the original input from its approximation produced by the generative reconstruction of the internal representation. Naturally, the objective then becomes to minimize the length of the sequence of bits that is needed to express the original input in this indirect manner. This corresponds to adopting the minimum description length (MDL) principle to guide the process of learning the representation. MDL is a general methodological principle stating that among a set of candidate models for the data, the one which can be represented in the fewest number of bits in the process of communication ought to be selected (see http://papers.nips.cc/paper/798-autoencoders-minimum-description-length-and-helmholtz-free-energy.pdf). <br />
<br />
In order to see how the MDL criterion is to be implemented in this context, we must first specify a more precise network structure and the procedure which generates the reconstruction. The authors restrict the network to consist of stochastic binary units taking values in {0,1}, where the probability of unit ''v'' being active is given by<br />
<br />
::<math> P(s_v = 1) = (1 + exp(-b_v - \sum_{u}^{} s_u w_{uv}))^{-1} </math> (1) <br/><br />
<br />
Here, <math> b_v </math> is the additive constant for unit ''v'', and <math> w_{uv} </math> is the weight associated with the connection to unit ''u''. For the bottom-up recognition connections, the units ''u'' which are summed over are from the immediately-preceding hidden layer, whereas these units will be from the immediately-following layer for the top-down generative connections.<br />
<br />
Now, to produce a reconstruction of a particular input, we first feed the input into the network, using the recognition weights to activate the units of each successive layer until the highest hidden layer is reached. With multiple hidden layers, this process is analogous to creating a hierarchy of representations; the first hidden layer generates a representation of the original input, the second layer produces a representation of the representation provided to it by the first layer, and each additional layer similarly provides a higher-order meta-representation of the input. At the final layer, the network begins the process of using this set of representations to express the approximation via the generative connections. To do this, for each unit of the top layer, the sigmoid function is simply applied to the additive constant in (1), as there are no subsequent layers providing influence. Then, for each lower layer, the activation probabilities of its units are computed using (1), where the ''u'' 's are the set of units in the next-highest layer. <br />
<br />
The collection of states <math> s_j </math> of the binary units with corresponding activation probabilities <math> p_j </math> for the representation produced by this process now gives us a means to quantify the description length of the representation. We know that in order to represent the state of the stochastic binary unit ''j'', we require <math> C(s_j) := -s_jlog p_j - (1 - s_j)log (1 - p_j) </math> bits. So, for an input d, the total description length for representing d in terms of its learned representation <math> \alpha </math> is the sum of the description lengths for the individual hidden units of the network plus the description length of the procedure for decoding the original input given the hidden representation: <br />
<br />
::<math> C(\alpha , d) = \sum_{j}^{}C(s_j^{\alpha}) + \sum_{i}^{}C(s_i^d | d) </math> <br/><br />
<br />
Recalling that the hidden units are stochastic, it is clear that a given input vector will have a broad array of possible network representations, making the recognition weights specify a distribution <math> Q(\alpha | d) </math> over representations of the input. Taking this into account, we realize that the cost function for representing an input ''d'' will be a representation cost averaged over the possible representations of ''d''. If we also consider the entropy associated with the variability in representing ''d'', the overall cost for representing ''d'' becomes <math> C(d) = \sum_{\alpha}^{} Q(\alpha | d)C(\alpha , d) - (- \sum_{\alpha}^{} Q(\alpha | d)log Q(\alpha | d)) </math>; namely, the expected description length for representing ''d'' subtracted by the entropy in representation.<br />
<br />
==Minimizing the Cost==<br />
<br />
It turns out that the representation cost we specified is related to the notion of Helmholtz free energy in statistical physics. Analogous to this context, we find that the cost is minimized by setting the probability of a representation to be proportionate to an exponential function of its cost:<br />
<br />
::<math> P(\alpha | d) = \frac {exp(-C(\alpha, d))}{\sum_{\beta}^{} exp(-C(\beta, d))} </math><br />
<br />
This corresponds to the Boltzmann distribution with temperature parameter set to 1. Essentially, this means that rather than attempting to minimize the description cost of any single particular representation of an input, we should seek to find recognition weights which allows <math> Q( \alpha |d) </math> to be a good approximation of this Boltzmann distribution. Given this cost and general criterion for minimizing it, we can now turn to the two-phase procedure used to determine the network weights.<br />
<br />
=The Wake Phase=<br />
<br />
As suggested by the name, in this stage, the network is receptive to external input. The system will receive input vectors which will be converted into internal representations across the hidden layers. To allow the input to propagate forwards through the network, the bottom-up recognition weights must be held fixed. Consequently, after an input vector has been processed in a bottom-up manner, we naturally seek to minimize top-down reconstruction error, the error incurred in using a layer's learned representation to predict the configuration in the layer below it. This is accomplished by updating the generative weights using the derivative of the description cost <math> C( \alpha , d) </math> with respect to the generative weights, for the learned representation <math>\alpha</math>. Using the purely local delta rule, this term for the top-down weight on the connection from ''k'' to ''j'' becomes:<br />
:: <math> \Delta w_{kj} = \epsilon s_k^{\alpha}(s_j^{\alpha} - p_j^{\alpha}) </math>. <br />
<br />
We see that this term encourages each hidden layer to improve its performance in predicting the states of the layer immediately beneath it. Intuitively, the weight <math>w_{kj}</math> only changes if the unit <math>s_k</math> is on. If both units are observed to be on, but the probability <math>p_j^{\alpha}</math> is low, then the weight is incremented slightly to increase the probability that the unit in the lower layer is on given that the unit in the upper layer is. Alternatively, if only the upper unit is observed to be on, but the probability <math>p_j^{\alpha}</math> is high, then weight is decremented slightly to decrease the probability that the unit in the lower is on given that the unit in the upper layer is. As such, the weight updates ensure that the generative connections allow units in the upper layer to better reconstruct the values of the units in the lower layer. <br />
<br />
=The Sleep Phase=<br />
<br />
While the wake phase involved processing external input for the purpose of improving the top-down reconstructions of it, the sleep phase closes the network off to external input and seeks to improve the performance of the recognition weights using only the current internal generative model of the world that has been learned. We recall that adjustment of the recognition weights is a necessary step, as the wake phase only considers the particular learned representation of the data ''d'', whereas we seek to minimize the total cost <math> C(d) </math> across all possible representations of ''d''. It is this step which encourages the distribution over representations <math> Q(\alpha | d) </math> to greater resemble the Boltzmann distribution. The idea is that, given we have tuned a generative model of the input in the wake phase, we can simulate external input to the network by sampling from these generative models, and see how the recognition weights perform in representing these simulated cases. Beginning at the highest layer, we use its stochastic units to generate a "fantasy" input, propagating this fantasy back down through the network using the generative connections. From there, we update the bottom-up weights with the objective of maximizing the log probability of capturing the states of the hidden units which generated this fantasy input. This involves computing the weight-adjustment term <math> \Delta w_{jk} = \epsilon s_j^{\gamma} (s_k^{\gamma} - q_k^{\gamma}) </math>, where <math> \gamma </math> determines the states of the hidden and input units for a given fantasy, and <math> q_k^{\gamma} </math> is the probability of unit ''k'' being activated as a result of applying the recognition weights to generate a state <math> s_j^{\gamma} </math> of node ''j'' in the layer below.<br />
<br />
We can see that the sleep phase tunes the recognition weights by considering their performance on hypothetical input drawn from the network's generative model of the input. This is a proxy for the objective of obtaining recognition weights that are well-calibrated to the true external input being considered. An obvious limitation in this approximation is that, in the early stages of training, the network's generative model will likely be a poor representation of the true distribution of the input, meaning that the recognition weights will be adjusted based on data which is largely uncharacteristic of the training set.<br />
<br />
=Model Limitations=<br />
<br />
* Due to the network structure, the distribution over representations <math> Q(\alpha | d) </math> is constrained to be a factorial distribution, as the states of units within the same hidden layer are independent when conditioned on the nodes of the layer below. This distributional form is advantageous in the sense that for a layer consisting of n hidden units, the probability distribution for the <math> 2^n </math> hidden states can be determined by providing n values, as opposed to <math> 2^n - 1 </math>. <br />
<br />
* On the other hand, this conditional independence property also prohibits the model from representing "explaining-away" effects, a type of relation in which the states of one layer can be efficiently represented by only activating exactly one of two units in the following layer. <br />
<br />
* However, the authors argue that the limitation of <math> Q(\alpha | d) </math> to be a factorial distribution is not completely debilitating, as, in the wake phase, the generative weights are adjusted to make the generative model similar to <math> Q(\alpha | d) </math>. In this sense, the generative model is encouraged to approximate the factorial form <math> Q(\alpha | d) </math>, which limits issues surrounding the discrepancies between these distributions.<br />
<br />
* Recognition weights only approximately follow gradient of variational bound of the log probability of the data.<br />
<br />
=Empirical Performance=<br />
<br />
To examine the strength of the approximations adopted in their approach, the wake-sleep procedure was implemented in a task of hand-written digit-recognition. In addition to achieving accurate compressions of the data, Figure 2 below displays the interesting result illustrating that, after the completion of learning, the internal generative model has been tuned to produce accurate fantasy digits.<br />
<br />
<center><br />
[[File:Digits.png |frame | center |Figure 2: Input digits (left) as compared to simulated fantasy digits (right). ]]<br />
</center><br />
<br />
=Discussion=<br />
<br />
* Jorg and Bengio (2014) has published a recent paper suggesting that other layer model algorithms such as Neural Autoregressive Distribution Estimato (NADE) are better approximators to the posterior distribution of latent variables.<ref>Bornschein, Jörg, and Yoshua Bengio. "Reweighted wake-sleep." arXiv preprint arXiv:1406.2751 (2014).</ref><br />
<br />
=References=<br />
<br />
1. Hinton GE, Dayan P, Frey BJ, Neal RM. The Wake-Sleep Algorithm for Unsupervised Neural Networks. Science 26, May 1995. DOI: 10.1126/science.7761831<br />
<br />
2. Hinton GE, Zemel RS. Autoencoders, Minimum Description Length, and Helmholtz Free Energy. Advances in Neural Information Processing Systems 6 (NIPS 1993).</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Generative_Stochastic_Networks_Trainable_by_Backprop&diff=27728deep Generative Stochastic Networks Trainable by Backprop2017-08-30T13:46:30Z<p>Conversion script: Conversion script moved page Deep Generative Stochastic Networks Trainable by Backprop to deep Generative Stochastic Networks Trainable by Backprop: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
[[File:figure_1_bengio.png |thumb|upright=1.75| Figure 1 Top: <ref>Bengio, Yoshua, Mesnil, Gregoire, Dauphin, Yann, and ´<br />
Rifai, Salah. Better mixing via deep representations. In<br />
ICML’13, 2013b. </ref> A denoising auto-encoder defines an estimated Markov chain where the transition operator first samples a corrupted <math>\bar{X}</math> from <math>C(\bar{X}|X)</math> and then samples a reconstruction<br />
from <math>P_o(X|\bar{X})</math>, which is trained to estimate the ground truth <math>P(X|\bar{X})</math><br />
. Note how for any given <math>\bar{X}</math> is a much<br />
simpler (roughly unimodal) distribution than the ground truth<br />
P(X) and its partition function is thus easier to approximate.<br />
Bottom: More generally, a GSN allows the use of arbitrary latent<br />
variables H in addition to X, with the Markov chain state (and<br />
mixing) involving both X and H. Here H is the angle about<br />
the origin. The GSN inherits the benefit of a simpler conditional<br />
and adds latent variables, which allow far more powerful deep<br />
representations in which mixing is easier]]<br />
<br />
The Deep Learning boom that has been seen in recent years was spurred initially by research in unsupervised learning techniques.<ref><br />
Bengio, Yoshua. Learning deep architectures for AI. Now<br />
Publishers, 2009.</ref>However, most of the major successes over the last few years have mostly been based on supervised techniques. A drawback for the unsupervised methods stems from their need for too many calculations and intractable sums in their models (inference, learning, sampling and partition functions). The paper presented puts forth an idea for a network that creates a model of a conditional distribution, <math>P(X|\bar{X})</math>, which can be seen as a local (usually) unimodal representation of <math>P(X)</math>. <math>\bar{X}</math> is a corrupted example of the original data <math>{X}</math>. The Generative Stochastic Network (GSN) combines arbitrary latent variables <math>H</math> that serve as input for a Markov chain which build in layers that eventually create a representation of the original data. Training of the network does not need Gibb's sampling or large partition functions but is trained with backpropagation and all the tools that come with it. <br />
<br />
In DBM <ref> Salakhutdinov, Ruslan and Hinton, Geoffrey E. Deep<br />
Boltzmann machines. In AISTATS’2009, pp. 448–455,<br />
2009 </ref>, sampling <math>P(x, h)</math> is estimated based on inference and sampling (contrastive divergence algorithm). To obtain a gradient there are intractable sums that must to calculated, however there are ways around this. The problem with these methods is that they make strong assumptions. In essence, the sampling methods for these calculations are biased towards certain distribution types (i.e. small number of modes). The attempt is to get around this. <br />
<br />
The reasoning for wanting to have a tractable generative model that uses unsupervised training is that within the realm of data, there is a far greater amount of unlabelled data than labelled data. Future models should be able to take advantage of this information.<br />
<br />
= Generative Stochastic Network (GSN) = <br />
[[File:figure_2_bengio.png |thumb|left|upright=2| Figure 2 Left: Generic GSN Markov chain with state variables Xt and Ht. Right: GSN Markov chain inspired by the unfolded<br />
computational graph of the Deep Boltzmann Machine Gibbs sampling process, but with backprop-able stochastic units at each layer.<br />
The training example X = x0 starts the chain. Either odd or even layers are stochastically updated at each step. All xt’s are corrupted by<br />
salt-and-pepper noise before entering the graph (lightning symbol). Each xt for t > 0 is obtained by sampling from the reconstruction<br />
distribution for that step <math>P_{\theta2}(Xt|Ht)</math>,. The walkback training objective is the sum over all steps of log-likelihoods of target X = x0<br />
under the reconstruction distribution. In the special case of a unimodal Gaussian reconstruction distribution, maximizing the likelihood<br />
is equivalent to minimizing reconstruction error; in general one trains to maximum likelihood, not simply minimum reconstruction error]]<br />
<br />
The paper describes the Generative Stochastic Network as a generalization of generative denoising autoencoders. This can be said as the estimations of the data are based on noised sampling. As opposed to directly estimating the data distribution, the model ventures to parametrize the transition of a Markov chain. This is the change that allows the problem to be transformed into a problem more similar to a supervised training problem. GSN relies on estimating the transition operator of a Markov chain, that is <math>P(x_t | x_{t-1})</math> or <math>P(x_t, h_t|x_{t-1}, h_{t-1})</math>, which contain a small number of important modes. This leads to a simple gradient of a partition function. Tries to leverage the strength of function approximation. GSN parametrizes the transition operators of Markov chain rather than <math>P(X)</math>. Allows for training of unsupervised methods by gradient descent and maximum likelihood with no partition functions, just back-propagation.<br />
<br />
The estimation of <math>P(X)</math> is as follows: create <math>\bar{X}</math> from corrupted distribution <math>C(\bar{X}|X)</math>. <math>C</math> is created by adding some type of noise to the original data. The model is then trained to reconstruct <math>X</math> from <math>\bar{X}</math> and thus obtain <math>P(X|\bar{X})</math>. This is easier to model then the whole of <math>P(X)</math> since <math>P(X|\bar{X})</math> is dominated by fewer modes than <math>P(X)</math>. Bayes rule then dictates that <math>P(X|\bar{X}) = \frac{1}{z}C(\bar{X}|X)P(X)</math>, z is an independent normalizing constant. This leads to the ability to construct <math>P(X)</math> based off the other two distributions. <br />
<br />
Using a parametrized model (i.e. a neural network) it was found that the approximation made by the model, <math>P_{\theta}(X|\bar{X})</math> could be used to approximate <math>P_{\theta}(X)</math>. The Markov chain distribution <math>\pi(X)</math> will eventually converge to <math>P(X)</math>. Figure 2 shows this process. <br />
<br />
One may wonder where the complexity of the original data distribution went?! If <math>P_{\theta}(X|\bar{X})</math> and <math>C(\bar{X}|X)</math> are not complex, then how can they model the complex distribution <math>P(X)</math>? They explain that even though <math>P_{\theta}(X|\bar{X})</math> has few modes, the location of the modes is dependent on <math>\bar{X}</math>. Since the estimation is based off of many values of <math>\bar{X}</math> and a mapping of <math>\bar{X}</math> to a mode location that allows the problem to become a supervised function approximation problem (which is easy).<br />
<br />
Training the GSN involves moving along a Markov chain that uses the transition distribution between nodes as a way to update the weights of the GSM. The transition distribution <math>f(h,h', x)</math> is trained to maximize reconstruction likelihood. The following picture demonstrates the Markov chain that allows for the training of the model. Note the similarities to Hinton's contrastive divergence.<br />
<br />
[[File:bengio_markov.png |centre|]]<br />
<br />
<br />
<br />
= Experimental Results =<br />
Some initial experimental results were created without extensive parameter alteration. This was done to maintain consistency over the tests and likely to show that even without optimization that the results approached the performance of more established unsupervised learning networks. The main comparison was made to Deep Boltzmann Machines (DBM) and Deep Belief Networks (DBN). <br />
<br />
=== MNIST ===<br />
<br />
The non-linearity for the units in the GSN was applied as <math display="block"> h_i = \eta_{out} + \tanh (\eta_{in} + a_i) </math>, with <math>a_i</math> as the linear activation for unit <math>i</math> and <math>\eta_{in}</math> and <math>\eta_{out}</math> are both zero mean Gaussian noise. Sampling of unfinished or incomplete data can be done in a similar manner to DBM, where representations can propagate upwards and downwards in the network. This allows for pattern completion similar to that achieved by DBM. The third image in Figure 3 demonstrates the GSN's ability to move from only half an image (where the rest is noise) and complete the digit, showing it has a internal representation of the digit that can be sampled to complete the digit. <br />
<br />
<br />
[[File:figure_3_bengio.png |thumb|centre|upright=2| Figure 3 Top: two runs of consecutive samples (one row after the<br />
other) generated from 2-layer GSN model, showing fast mixing<br />
between classes, nice and sharp images. Note: only every fourth<br />
sample is shown; see the supplemental material for the samples<br />
in between. Bottom: conditional Markov chain, with the right<br />
half of the image clamped to one of the MNIST digit images and<br />
the left half successively resampled, illustrating the power of the<br />
generative model to stochastically fill-in missing inputs.]]<br />
<br />
=== Faces ===<br />
<br />
The following figure shows the GSN's ability to perform facial reconstruction. <br />
<br />
[[File:figure_4_bengio.png |thumb | upright=2|centre | Figure 4 GSN samples from a 3-layer model trained on the TFD<br />
dataset. Every second sample is shown; see supplemental material<br />
for every sample. At the end of each row, we show the nearest<br />
example from the training set to the last sample on that row, to illustrate<br />
that the distribution is not merely copying the training set.]]<br />
<br />
<br />
=== Comparison ===<br />
Test set log-likelihood lower bound (LL) obtained by<br />
a Parzen density estimator constructed using 10000 generated<br />
samples, for different generative models trained on MNIST.<br />
The LL is not directly comparable to AIS likelihood estimates<br />
because we use a Gaussian mixture rather than a Bernoulli<br />
mixture to compute the likelihood. A DBN-2 has 2 hidden layers, a (Contrastive Autoencoder) CAE-1<br />
has 1 hidden layer, and a CAE-2 has 2. The (Denoising Autoencoder)DAE is basically a<br />
GSN-1, with no injection of noise inside the network.<br />
<br />
[[File:GSN_comparison.png]]<br />
<ref>Rifai, Salah, Bengio, Yoshua, Dauphin, Yann, and Vincent,<br />
Pascal. A generative process for sampling contractive<br />
auto-encoders. In ICML’12, 2012</ref><br />
<br />
= Conclusions and Critique =<br />
The main objective of the paper and technique was to avoid the intractable aspects of traditional generative models. This was achieved by training a model to reconstruct noisy data, which created a local and simple approximation of the whole data distribution. This was done over and over, treated as a Markov chain, with each transition distribution corresponding to a new representation of the data distribution. This can be trained with supervised neural network tools. Experiments shows similarity between results from the GSN and the DBM. However, there is no need for layer wise pre-training on the GSN. <br />
<br />
One critique for this paper is that they continually point out that there method should, in theory, be faster than the traditional models. They show that a similar model can achieve similar results but they do not provide any information on the time each network took to train. This could be done by having networks with approximately the same numbers of parameters train for a specific task and be timed and evaluated based upon that. <br />
The paper does not do a very good job of describing how the training is done in relation to the Markov chain. The relationship can be teased out eventually, though it is not immediately apparent and could have been elaborated upon further. <br />
There is one section that briefly glosses over Sum Product Networks (SPN) as an alternative tractable graphical model. Since the SPN are solving the same problem that they are proposing to solve, it would have made sense for them to evaluate their model compared to the SPN as well, however they failed to do this.<br />
<br />
= References =<br />
<references></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extracting_and_Composing_Robust_Features_with_Denoising_Autoencoders&diff=27723Extracting and Composing Robust Features with Denoising Autoencoders2017-08-30T13:46:30Z<p>Conversion script: Conversion script moved page Extracting and Composing Robust Features with Denoising Autoencoders to extracting and Composing Robust Features with Denoising Autoencoders: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[extracting and Composing Robust Features with Denoising Autoencoders]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27725Very Deep Convoloutional Networks for Large-Scale Image Recognition2017-08-30T13:46:30Z<p>Conversion script: Conversion script moved page Very Deep Convoloutional Networks for Large-Scale Image Recognition to very Deep Convoloutional Networks for Large-Scale Image Recognition: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[very Deep Convoloutional Networks for Large-Scale Image Recognition]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Wake-Sleep_Algorithm_for_Unsupervised_Neural_Networks&diff=27727The Wake-Sleep Algorithm for Unsupervised Neural Networks2017-08-30T13:46:30Z<p>Conversion script: Conversion script moved page The Wake-Sleep Algorithm for Unsupervised Neural Networks to the Wake-Sleep Algorithm for Unsupervised Neural Networks: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[the Wake-Sleep Algorithm for Unsupervised Neural Networks]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Generative_Stochastic_Networks_Trainable_by_Backprop&diff=27729Deep Generative Stochastic Networks Trainable by Backprop2017-08-30T13:46:30Z<p>Conversion script: Conversion script moved page Deep Generative Stochastic Networks Trainable by Backprop to deep Generative Stochastic Networks Trainable by Backprop: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[deep Generative Stochastic Networks Trainable by Backprop]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=27714deep Learning of the tissue-regulated splicing code2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Deep Learning of the tissue-regulated splicing code to deep Learning of the tissue-regulated splicing code: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue-dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math> <br />
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math><br />
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math><br />
::::::: this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Training the model =<br />
<br />
The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with back-propagation method. <br />
<br />
The DNN weights were initialized with small random values sampled from a standard Gaussian distribution. Learning was performed with stochastic gradient descent with momentum and dropout, where mini-batches were constructed. A small L1 weight penalty was included in the cost function. The model’s weights were updated after each mini-batch. The learning rate was decreased with epochs <math>\epsilon</math>, and also included a momentum term <math>\mu</math> that starts out at 0.5, increasing to 0.99, and then stays fixed. The weights of the model parameters <math>\theta</math> were updated as follows:<br /><br />
<br />
::: <math> \, \theta_e = \theta_{e-1} + \Delta \theta_e </math><br />
<br />
::: <math> \Delta\theta_e = \mu_e\Delta\theta_{e-1} - (1-\mu_e)\epsilon_e\nabla E(\theta_e) </math><br />
<br />
In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples. <br />
<br />
Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. <br />
<br />
The targets consist of (i) PSI for each of the two tissues and (ii) <math> \Delta PSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \Delta PSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease. The training examples are constructed with some redundancy (i.e., in some of the training examples the two tissues are identical) so the model will learn this without it having to be be explicitly specified.<br />
<br />
The batches for training were biased such that earlier batches contain 4/5 samples with higher tissues variability and 1/5 with low tissue variablity. After the high-variability examples are all used, the batches randomly select from the remaining lower-variability examples. The stated purpose is to give examples with high-tissue variability greater importance, while avoiding over-fitting by having them early in the training.<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=27716deep Convolutional Neural Networks For LVCSR2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Deep Convolutional Neural Networks For LVCSR to deep Convolutional Neural Networks For LVCSR: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
A slight improvement can be obtained by using 128 hidden units for the first convolutional layer and 256 for the second layer, which uses more hidden units in the convolutional layers, as many hidden units are needed to capture the locality differences between different frequency regions in speech.<br />
<br />
== Optimal Feature Set ==<br />
We should note that the Linear Discriminant Analysis (LDA) cannot be used with CNNs because it removes local correlation in frequency. So they use Mel filter-bank (FB) features which exhibit this locality property.<br />
<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The optimal architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
Broadcast News consists of 400 hours of speech data and it was used for training. DARPA EARS rt04 and def04f datasets were used for testing. The following table shows that CNN-based features offer 13-18% relative improvment over GMM/HMM system and 10-12% over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
<br />
== Switchboard ==<br />
<br />
Switchboard dataset is a 300 hours of conversational American English telephony data. Hub5'00 dataset is used as validation set, while rt03 set is used for testing. Switchboard (SWB) and Fisher (FSH) are portions of the set, and the results are reported separately for each set. Three systems, as shown in the following table, were used in comparisons. CNN-based features over 13-33% relative improvement over GMM/HMM system, and 4-7% relative improvement over hybrid DNN system. These results show that CNNs are superior to both GMMs and DNNs.<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
<br />
= Conclusions and Discussions =<br />
<br />
This paper demonstrates that CNNs perform well for LVCSR and shows that multiple convolutional layers gives even more improvement when the convolutional layers have a large number of feature maps. In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems. Also, the Mel filter-bank is regarded as a suitable feature for the CNN since it exhibits this locality property.<br />
In fact, CNN’s are able to capture translational invariance for different speakers with by replicating weights in time and frequency domain, and they can model local correlations of speech.<br />
<br />
In this paper, the authors draw the conclusion that having 2 convolutional and 4 fully connected layers is optimal for CNNs. But from previous table we can see the result for 2 convolutional and 4 fully connected layers is close to the result of 3 convolutional and 3 fully connected layers. More experiments and assumptions may be needed to draw this conclusion statistically.<br />
<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=scene_Parsing_with_Multiscale_Feature_Learning,_Purity_Trees,_and_Optimal_Covers_Machines&diff=27718scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines to scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines: Converting page titles to lower...</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Farabet, Clement, et al. [http://arxiv.org/pdf/1202.2160v2.pdf "Scene parsing with multiscale feature learning, purity trees, and optimal covers."] arXiv preprint arXiv:1202.2160 (2012).<br />
</ref> presents an approach to full scene labelling (FSL). This is the task of giving a label to each pixel in an image corresponding to which category of object it belongs to. FSL involves solving the problems of detection, segmentation, recognition, and contextual integration simultaneously. One of the main obstacles of FSL is that the information required for labelling a particular pixel could come from very distant pixels as well as their labels. This distance often depends on the particular label as well (e.g. the presence of a wheel might mean there is a vehicle nearby, while an object like the sky or water could span the entire image, and figuring out to which class a particular blue pixel belongs could be challenging).<br />
<br />
= Overview =<br />
<br />
The proposed method for FSL works by first computing a tree of segments from a graph of pixel dissimilarities. A set of dense feature vectors is then computed, encoding regions of multiple sizes centered on each pixel. Feature vectors are aggregated and fed to a classifier which estimates the distribution of object categories in a segment. A subset of tree nodes that cover the image are selected to maximize the average "purity" of the class distributions (i.e. maximizing the likelihood that each segment will contain a single object). The convolutional network feature extractor is trained end-to-end from raw pixels, so there is no need for engineered features.<br />
<br />
There are five main ingredients to this new method for FSL:<br />
<br />
# Trainable, dense, multi-scale feature extraction<br />
# Segmentation tree<br />
# Regionwise feature aggregation<br />
# Class histogram estimation<br />
# Optimal purity cover<br />
<br />
The three main contributions of this paper are:<br />
<br />
# Using a multi-scale convolutional net to learn good features for region classification<br />
# Using a class purity criterion to decide if a segment contains a single object, as opposed to several objects, or part of an object<br />
# An efficient procedure to obtain a cover that optimizes the overall class purity of a segmentation<br />
<br />
= Previous Work =<br />
<br />
Most previous methods of FSL rely on MRFs, CRFs, or other types of graphical models to ensure consistency in the labeling and to account for context. This is typically done using a pre-segmentation into super-pixels or other segment candidates. Features and categories are then extracted from individual segments and combinations of neighboring segments.<br />
<br />
Using trees allows the use of fast inference algorithms based on graph cuts or other methods. In this paper, an innovative method based on finding a set of tree nodes that cover the images while minimizing some criterion is used.<br />
<br />
= Model =<br />
<br />
This model relies on two complementary image representations. In the first representation, the image is seen as a point in a high-dimensional space, and we seek to find a transform <math>f: \mathbb{R}^P \rightarrow \mathbb{R}^Q</math> that maps these images into a space in which each pixel can be assigned a label using a simple linear classifier. In the second representation, the image is seen as an edge-weighted graph, on which a hierarchy of segmentations/clusterings can be constructed. This representation yields a natural abstraction of the original pixel grid, and provides a hierarchy of observation levels for all the objects in the image. The full model is shown in the diagram below. It is an end-to-end trainable model for scene parsing.<br />
<br />
[[File:SceneModelDiagram.png]]<br />
<br />
== Pre-processing ==<br />
<br />
Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below<br />
<br />
[[File:Image_pyramid.png ]]<br />
<br />
The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.<br />
<br />
This first step typically suffers from two main problems: (1) Due to the fact that window sizes are different, in some window cases, an object is not properly centered and scaled. Therefore, it offers a poor observation to predict the class of underlying object. (2) integrating a large context involves increasing the dimensionality P of the input. Then it is necessary to enforce some invariance in the function f itself.<br />
<br />
== Network Architecture ==<br />
<br />
More holistic tasks, such as full-scene understanding (pixel-wise labeling, or any dense feature estimation) require the system to model complex interactions at the scale of complete images, not simply within a patch. In this problem the dimensionality becomes unmanageable: for a typical image of 256×256 pixels, a naive neural network would require millions of parameters, and a naive convolutional network would require filters that are unreasonably large to view enough context. The multiscale convolutional network overcomes these limitations by extending the concept of weight replication to the scale space. The more scales used to jointly train the models, the better the representation becomes for all scales. Using the same function to extract features at each scale is justified because the image content is scale invariant in principle. The authors noted that they observed worse performance when the weight sharing was removed.<br />
<br />
== Post-Processing ==<br />
<br />
In this model the sampling is done using an elastic max-pooling function, which remaps input patterns of arbitrary size into a fixed G×G grid (in this case a 5x5 grid was used). This grid can be seen as a highly invariant representation that encodes spatial relations between an object’s attributes/parts. This representation is denoted O<sub>k</sub> and is shown in the diagram below. With this encoding elongated or ill-shaped objects are nicely handled. The dominant features are also used to represent the object, and when combined with background subtraction, these features represent good basis functions to recognize the underlying object. These features are then associated to the corresponding areas of the tree segmentation of the image (generated by creating a minimum spanning tree from the dissimilarity graph of neighboring pixels) for optimal cover calculation.<br />
<br />
[[File:SceneGridFeatures.png]]<br />
<br />
One of the important features of this model is its method for optimal cover, which is detailed in the diagram below. The leaf nodes represent pixels in the image and a subset of tree nodes are selected whose aggregate children span the entire image. The nodes are selected to minimize the average "impurity" of the class distribution (i.e. the entropy). The cover attempts to find an overall consisten segmentation, where each selected node corresponds to a particular class labelling for itself and all of its unselected children.<br />
<br />
[[File:SceneOptimalCover.png]]<br />
<br />
<br />
== Training ==<br />
<br />
Training is done in a two step process. First, the low level feature extractor <math>f_s</math> is trained to produce features that are maximally discriminative. Then, the classifier <math>c</math> is trained to predict the distriubiton of casses in a component. The feature vectors are obtained by concatenating the network outputs for different scales of the multiscale pyramid. To train for them the loss function<br />
<math>L_{\mathrm{cat}} = - \sum_{i \in \mathrm{pixels}, a \in \mathrm{classes}} c_{i,a} \ln(\hat{c}_{i,a})</math><br />
is used, where <math>c_i</math> is the true (classification) target vector and <math>\hat{c}_i</math> the prediction from a linear classifier (which is only used in this step and will be discarded later).<br />
<br />
After training parameters for the feature extraction, parameters of the actual classifier is trained my minimizing the Kullback-Leibler-divergence (KL-divergence) between the true distribution of labels in each component and the prediction from the classifier. The KL-divergence is a measure of the difference between two probability distributions.<br />
<br />
= Experiments =<br />
<br />
For all experiments, a 2-stage convolutional network was used. The input is a 3-channel image, and it is transformed into a 16-dimensional feature map, using a bank of 16 7x7 filters followed by tanh units. This feature map is then pooled using a 2x2 max-pooling layer. The second layer transforms the 16-dimensional feature map into a 64-dimensional feature map, with each component being produced by a combination of 8 7x7 filters (for an effective total of 512 filters), followed by tanh units. This map is also pooled using a 2x2 max-pooling layer. This 64-dimensional feature map is transformed into a 256-dimensional feature map by using a combination of 16 7x7 filters (2048 filters).<br />
<br />
The network is applied to a locally normalized Laplacian pyramid constructed on the input image. The pyramid contains three rescaled versions of the input: 320x240, 160x120, and 80x60. All of the inputs are properly padded and the outputs of each of the three networks are upsampled and concatenated to produce a 768-dimensional feature vector map (256x3). The network is trained on all three scales in parallel.<br />
<br />
A simple grid search was used to find the best learning rate and regularization parameters (weight decay). A holdout of 10% of the training data was used as a validation set during the parameter search. For both datasets, jitter was used to artificially expand the size of the training data, to try to allow features to not overfit irrelevant biases present in the data. This jitter included horizontal flipping, and rotations between -8 and 8 degrees.<br />
<br />
The hierarchy used to find the optimal cover is a constructed on the raw image gradient, based on a standard volume criterion<ref><br />
F. Meyer and L. Najman. [http://onlinelibrary.wiley.com/doi/10.1002/9781118600788.ch9/summary "Segmentation, minimum spanning tree and hierarchies."] In L. Najman and H. Talbot, editors, Mathematical Morphology: from theory to application, chapter 9, pages 229–261. ISTE-Wiley, London, 2010.<br />
</ref><ref><br />
J. Cousty and L. Najman. [http://link.springer.com/chapter/10.1007/978-3-642-21569-8_24 "Incremental algorithm for hierarchical minimum spanning forests and saliency of watershed cuts."] In 10th International Symposium on Mathematical Morphology (ISMM’11), LNCS, 2011.<br />
</ref>, completed by removing non-informative small components (less than 100 pixels). Traditionally segmentation methods use a partition of segments (i.e. finding an optimal cut in the tree) rather than a cover. A number of graph cut methods were tried, but the results were systematically worse than the optimal cover method.<br />
<br />
Two sampling methods for learning the multiscale features were tried on each dataset. One uses the natural frequencies of each class in the dataset, while the other balances them so that an equal number of each class is shown to the network. The results from each of these methods varied with the dataset used and are reported in the tables below. The authors only included the results for the frequency balancing method for the Stanford Background dataset as it consistently gave better results, but it could still be useful to have the results from the other method to help guide future work. Training with balanced frequencies allows better discrimination of small objects, and although it tends to have lower overall pixel-wise accuracy, it performs better from a recognition point of view. This observation can be seen in the tables below. The per-pixel accuracy for frequency balancing in the Barcelona dataset is quite poor, which the authors attribute by the fact that the dataset has a large amount of classes with very few training examples, leading to overfitting when trying to model them in this manner.<br />
<br />
= Results =<br />
<br />
[[File:SceneResultTableStanford.png]]<br />
<br />
[[File:SceneResultTableSIFT.png]]<br />
<br />
[[File:SceneResultTableBarcelona.png]]<br />
<br />
[[File:SceneResultPictures.png]]<br />
<br />
=References=<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27720learning Phrase Representations2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Learning Phrase Representations to learning Phrase Representations: Converting page titles to lowercase</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes. At each time step t, the hidden state <math>h_{t}</math> of the RNN is updated by:<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
where ''f'' is a non-linear activation function. ''f'' may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit. After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes the posterior<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the log-linear model showed above when tuning the maximum a posterior SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
Another model, similar to Schwenk's, is by Devlin and a feedforward neural network is also used. Rather than estimating the probability of the entire output sequence of words in Schwenk's model, Devlin only estimates the probability of the next word and uses both a portion of the input sentence and a portion of the output sentence. It reported impressive improvements but similar to Schwenk, it fixes the length of input prior to training.<br />
<br />
Chandar et al. trained a feedforward neural network to learn a mapping from a bag-of-words representation of an input phrase to an output phrase.<ref><br />
Lauly, Stanislas, et al. "An autoencoder approach to learning bilingual word representations." Advances in Neural Information Processing Systems. 2014.<br />
</ref> This is closely related to both the proposed RNN Encoder–Decoder and the model<br />
proposed by Schwenk, except that their input representation of a phrase is a bag-of-words. A similar approach of using bag-of-words representations was proposed by Gao<ref><br />
Gao, Jianfeng, et al. "Learning semantic representations for the phrase translation model." arXiv preprint arXiv:1312.0482 (2013).<br />
</ref> as well. One important difference between the proposed RNN Encoder–Decoder and the above approaches is that the order of the words in source and target phrases is taken into account. The RNN Encoder–Decoder naturally distinguishes between sequences that have the same words but in a different order, whereas the aforementioned approaches effectively ignore order information.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
The following model combinations were tested:<br />
# Baseline configuration<br />
# Baseline + RNN<br />
# Baseline + CSLM + RNN<br />
# Baseline + CSLM + RNN + Word penalty<br />
<br />
The results are shown in Figure 3. The RNN encoder-decoder consisted of 1000 hidden units. Rank-100 matrices were used to connect the input to the hidden unit. The "word penalty" attempts to penalize the words unknown to the neural network, which is accomplished by using the number of unknown words as a feature in the log-linear model above. <br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Genetics&diff=27713Genetics2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Genetics to genetics: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[genetics]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_of_the_tissue-regulated_splicing_code&diff=27715Deep Learning of the tissue-regulated splicing code2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Deep Learning of the tissue-regulated splicing code to deep Learning of the tissue-regulated splicing code: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[deep Learning of the tissue-regulated splicing code]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Convolutional_Neural_Networks_For_LVCSR&diff=27717Deep Convolutional Neural Networks For LVCSR2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Deep Convolutional Neural Networks For LVCSR to deep Convolutional Neural Networks For LVCSR: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[deep Convolutional Neural Networks For LVCSR]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Scene_Parsing_with_Multiscale_Feature_Learning,_Purity_Trees,_and_Optimal_Covers_Machines&diff=27719Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines to scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines: Converting page titles to lower...</p>
<hr />
<div>#REDIRECT [[scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_Phrase_Representations&diff=27721Learning Phrase Representations2017-08-30T13:46:29Z<p>Conversion script: Conversion script moved page Learning Phrase Representations to learning Phrase Representations: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[learning Phrase Representations]]</div>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory_Networks&diff=27709Memory Networks2017-08-30T13:46:28Z<p>Conversion script: Conversion script moved page Memory Networks to memory Networks: Converting page titles to lowercase</p>
<hr />
<div>#REDIRECT [[memory Networks]]</div>Conversion script