http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Rqiao&feedformat=atomstatwiki - User contributions [US]2024-03-28T14:29:13ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Long-Range_Vision_for_Autonomous_Off-Road_Driving&diff=27357learning Long-Range Vision for Autonomous Off-Road Driving2015-12-18T04:23:19Z<p>Rqiao: /* Feature Extraction */</p>
<hr />
<div>= Introduction =<br />
<br />
Stereo-vision has been used extensively for mobile robots in identifying near-to-far obstacles in its path, but is limited by it's max range of 12 meters. For the safety of high speed mobile robots recognizing obstacles at longer ranges is vital.<br />
<br />
The authors of this paper proposed a "long-range vision vision system that uses self-supervised learning to train a classifier in real-time" <ref name="hadsell2009">Hadsell, Raia, et al. "Learning long‐range vision for autonomous off‐road driving." Journal of Field Robotics 26.2 (2009): 120-144.</ref>; to robustly increase the obstacle and path detection range to over 100 meters. This approach has been implemented and tested on the Learning Applied to Ground Robots (LAGR) provided by the National Robotics Engineering Center (NREC).<br />
<br />
= Related Work =<br />
<br />
A common approach to vision-based driving is to process images captured from a pair of stereo cameras, produce a point cloud and use various heuristics to build a traversability map <ref name="goldberg2002">Goldberg, Steven B., Mark W. Maimone, and Lany Matthies. "Stereo vision and rover navigation software for planetary exploration." Aerospace Conference Proceedings, 2002. IEEE. Vol. 5. IEEE, 2002.</ref> <ref name="kriegman1989">Kriegman, David J., Ernst Triendl, and Thomas O. Binford. "Stereo vision and navigation in buildings for mobile robots." Robotics and Automation, IEEE Transactions on 5.6 (1989): 792-803.</ref> <ref name="kelly1998">Kelly, Alonzo, and Anthony Stentz. "Stereo vision enhancements for low-cost outdoor autonomous vehicles." Int’l Conf. on Robotics and Automation, Workshop WS-7, Navigation of Outdoor Autonomous Vehicles. Vol. 1. 1998.</ref><br />
, there has been efforts to increase the range of stereo vision by using the color of nearby ground and obstacles, but these color based improvements can easily be fooled by shadows, monochromatic terrain and complex obstacles or ground types.<br />
<br />
More recent vision based approaches such as <ref name="hong2002">Hong, Tsai Hong, et al. "Road detection and tracking for autonomous mobile robots." AeroSense 2002 (2002): 311-319.</ref> <ref name="lieb2005">Lieb, David, Andrew Lookingbill, and Sebastian Thrun. "Adaptive Road Following using Self-Supervised Learning and Reverse Optical Flow." Robotics: Science and Systems. 2005.</ref> <ref name="dahlkamp2006">Dahlkamp, Hendrik, et al. "Self-supervised Monocular Road Detection in Desert Terrain." Robotics: science and systems. 2006.</ref> use learning algorithms to map traversability information to color histograms or geometric (point cloud) data has achieved success in the DARPA challenge.<br />
<br />
Other, non-vision-based systems have used the near-to-far learning paradigm to classify distant sensor data based on self-supervision from a reliable, close-range sensor. A self-supervised classifier was trained on satellite imagery and ladar sensor data for the Spinner vehicle’s navigation system<ref><br />
Sofman, Boris, et al. "Improving robot navigation through self‐supervised online learning." Journal of Field Robotics 23.11‐12 (2006): 1059-1075.<br />
</ref><br />
and an online self-supervised classifier for a ladar-based navigation system was trained to predict load-bearing surfaces in the presence of vegetation.<ref><br />
Wellington, Carl, and Anthony Stentz. "Online adaptive rough-terrain navigation vegetation." Robotics and Automation, 2004. Proceedings. ICRA'04. 2004 IEEE International Conference on. Vol. 1. IEEE, 2004.<br />
</ref><br />
<br />
= Challenges =<br />
<br />
* <span>'''Choice of Feature Representation''': For example how to choose a robust feature representation that is informative enough that avoid irrelevant transformations</span><br />
* <span>'''Automatic generation of Training labels''': Because the classifier devised is trained in real-time, it requires a constant stream of training data and labels to learn from.</span><br />
* <span>'''Ability to generalize from near to far field''': Objects captured by the camera scales inversely proportional to the distance away from the camera, therefore the system needs to take this into account and normalize the objects detected.</span><br />
<br />
= Overview of the Learning Process =<br />
<br />
[[Image:method.png|frame| center | 400px | alt=|Learning System Proposed by <ref name="hadsell2009" /> ]]<br />
<br />
The learning process described by is as follows:<br />
<br />
# <span>'''Pre-Processing and Normalization''': This step involves correcting the skewed horizon captured by the camera and normalizing the scale of objects captured by the camera, since objects captured scales inversely proportional to distance away from camera.</span><br />
# <span>'''Feature Extraction''': Convolutional Neural Networks were trained and used to extract features in order to reduce dimensionality.</span><br />
# <span>'''Stereo Supervisor Module''': Complicated procedure that uses multiple ground plane estimation, heuristics and statistical false obstacle filtering to generate class labels to close range objects in the normalized input. The goal is to generate training data for the classifier at the end of this learning process.</span><br />
# <span>'''Training and Classification''': Once the class labels and feature extraction training data is combined, it is fed into the classifier for real-time training. The classifier is trained on every frame and the authors have used stochastic gradient descent to update the classifier weights and cross entropy as the loss function.</span><br />
<br />
== Pre-Processing and Normalization ==<br />
<br />
At the first stage of the learning process there are two issues that needs addressing, namely the skewed horizon due to the roll of camera and terrain, secondly the true scale of objects that appear in the input image, since objects scale inversely proportional to distance away from camera, the objects need to be normalized to represent its true scale.<br />
<br />
[[Image:horizon_pyramid.png|frame| center | 400px | alt=|Horizon Pyramid <ref name="hadsell2009" /> <span data-label="fig:hpyramid"></span>]]<br />
<br />
To solve both issues, a normalized “pyramid” containing 7 sub-images are extracted (see figure above), where the top row of the pyramid has a range from 112 meters to infinity and the closest pyramid row has a range of 4 to 11 meters. These pyramid sub images are extracted and normalized from the input image to form the input for the next stage.<br />
<br />
[[Image:horizon_normalize.png|frame| center | 400px | alt=|Creating target sub-image <ref name="hadsell2009" /> <span data-label="fig:hnorm"></span>]]<br />
<br />
To obtain the scaled and horizon corrected sub images the authors have used a combination of a Hough transform and PCA robust refit to estimate the ground plane <math>P = (p_{r}, p_{c}, p_{d}, p_{o})</math>. Where <math>p_{r}</math> is the roll, <math>p_{c}</math> is the column, <math>p_{d}</math> is the disparity and <math>p_{o}</math> is the offset. Once the ground plane <math>P</math> is estimated, the horizon target sub-image <math>A, B, C, D</math> (see figure above) is computed by calculating the plane <math>\overline{EF}</math> with stereo disparity of <math>d</math> pixels. The following equations were used to calculate the center of the line <math>M</math>, the plane <math>\overline{EF}</math>, rotation <math>\theta</math> and finally points <math>A, B, C, D</math>.<br />
<br />
<math>\textbf{M}_{y} = \frac{p_{c} \textbf{M}_{x} + p_{d} d + p-{o}}{-p_{r}}</math><br />
<br />
<math>E = (<br />
\textbf{M}_{x} - \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} - \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>F = (<br />
\textbf{M}_{x} + \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} + \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>\theta = \left( <br />
\frac{\textbf{w}_{pc} + p_{d} + p_{o}}{-p_{r}}<br />
- \frac{p_{d} + p_{o}}{-p_{r}} / w<br />
\right)</math><br />
<br />
<math>A = (<br />
\textbf{E}_{x} + \alpha \sin \theta,<br />
\textbf{E}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>B = (<br />
\textbf{F}_{x} + \alpha \sin \theta,<br />
\textbf{F}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>C = (<br />
\textbf{F}_{x} - \alpha \sin \theta,<br />
\textbf{F}_{y} + \alpha \cos \theta,<br />
)</math><br />
<br />
<math>D = (<br />
\textbf{E}_{x} - \alpha \sin \theta,<br />
\textbf{E}_{y} + \alpha \cos \theta,invariance<br />
)</math><br />
<br />
The last step of this stage is that the images were converted from RGB to YUV, common in image processing pipelines.<br />
<br />
== Feature Extraction ==<br />
<br />
The goal of the feature extraction is to reduce the input dimensionality and increase the generality of the resulting classifier to be trained. Instead of using hand-tuned feature list, <ref name="hadsell2009" /> used a data driven approach and trained 4 different feature extractors, this is the only component of the learning process where it is trained off-line.<br />
<br />
<ul><br />
<li><p>'''Radial Basis Functions (RBF)''': A set of RBF were learned to form a feature vector by calculating the Euclidean distance between input window and each of the 100 RBF centers. Where each feature vector <math>D</math> has the form:</p><br />
<p><math>D_{j} = exp(-\beta^{i} || X - K^{i} ||^{2}_{2})</math></p><br />
<p>Where <math>\beta^{i}</math> is the inverse variance of the RBF center <math>K^{i}</math>, <math>X</math> is the input window, <math>K</math> is the set of <math>n</math> radial basis centers <math>K = \{K^{i} | i = 1 \dots n\}</math>.</p></li><br />
<li><p>'''Convolution Neural Network (CNN)''': A standard CNN was used, the architecture consisted of two layers, the first has 20 7x6 filters and the second has 369 6x5 filters. During training a 100 fully connected hidden neuron layer is added as a last layer to train with 5 outputs. Once the network is trained however that last layer was removed, and thus the resulting CNN outputs a 100 component feature vector. For training the authors random initialized the weights, used stochastic graident decent for 30 epochs, and <math>L^2</math> regularization. The network was trained against 450,000 labeled image patches, and tested against 50,000 labeled patches.</p></li><br />
<li><p>'''Supervised and Unsupervised Auto-Encoders''': Auto-Encoders or Deep Belief Networks <ref name="hinton2006">Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.</ref> <ref name="ranzato2007">Ranzato, Marc Aurelio, et al. "Unsupervised learning of invariant feature hierarchies with applications to object recognition." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.</ref> is a layer wise training procedure. The deep belief net trained has 3 layers, where the first and third are convolutional layers, and the second one is a maxpool layer, the architecture is explained in figure below. Since the encoder contains a maxpool layer, the decoder should have an unpool layer, but the author didn't specify which kind of unpool technique they use.</p><br />
[[Image:convo_arch.png|frame| center | 400px | alt=|Convolution Neural Network <ref name="hadsell2009" /> <span data-label="fig:convoarch"></span>]]<br />
<br />
<p>For training, the loss function is the mean squared loss between the original input and decoded picture. At first the network is trained with 10,000 unlabeled images (unsupervised training) with varying outdoor settings (150 settings), then the network is fined tuned with labeled dataset (supervised training), the authors did not mention how large the labeled dataset was, and what training parameters were used for the supervised stage.</p></li></ul><br />
<br />
== Stereo Supervisor Module ==<br />
<br />
[[Image:ground_plane_estimation.png|frame| center | 400px | alt=|Ground Plane Estimation <ref name="hadsell2009" /> <span data-label="fig:gplanes"></span>]]<br />
<br />
Once the images have been preprocessed and normalized, stereo vision algorithms are used to produce data samples and labels that are “visually consistent, error free and well distributed”. There are 4 steps at this stage:<br />
<br />
<ol><br />
<li><p><span>'''3D point cloud''': Step one, a 3D point cloud is produced by using the Triclops stereo vision algorithm from Point Grey Research. The algorithm used has a range of 12 to 15 meters and works by triangulating objects between two images to find the depth.</span></p></li><br />
<li><p>'''Estimation of ground plane''': Secondly a ground plane model is found by using a combination of Hough transform and principle component analysis (PCA) to fit a plane onto the point cloud <math>S = \{ (x^{i}, y^{i}, z^{i}) | i = 1 \dots n) \} </math>. Where <math>x^{i}, y^{i}, z^{i}</math> defines the position of point relative to the robot’s center, and <math>n</math> is the number of points in the point cloud.</p><br />
<p>The rational behind using Hough transform is since multiple ground planes can be found (see figure above), a voting system was introduced where by the parameter vector which denotes the ground plane parameter (such as pitch, roll and offset) and has the most votes is used. It is selected by the following equation:</p><br />
<p><math>X = P_{ijk} | i, j, k = argmax_{i,j,k} (V_{ijk})</math></p><br />
<p>Where <math>X</math> is the new plane estimate, <math>V</math> is a tensor that accumulates the votes and <math>P</math> is a tensor that records the plane parameter space. Then PCA is used to refit and compute the eigenvalue decomposition of the covariance matrix of the points <math>X^{1 \dots n}</math>.</p><br />
<p><math>\frac{1}{n} \sum^{n}_{1} X^{i} X^{i'} = Q \Lambda Q</math></p><br />
<p>It should be noted, however, that multiple ground planes does not eliminate all errors from the labeling process. The authors of this paper used the following heuristics to minimize the errors in the training data. The heuristic is and I quote:<br />
<br />
<blockquote><br />
{{Quote|text=}} <ref name="hadsell2009" /><br />
</blockquote><br />
<br />
<li><span>'''Projection''': Stereo vision has the limitation of only being able to robustly detect short range (12m max) objects. In an attempt to mitigate the uncertainty of long range objects, footlines of obstacles (the bottom outline of the obstacle) are used. This gives stereo vision better estimates about the scale and distance of long range objects. The footline of long range objects are found by projecting obstacle points onto the ground planes and marking high point-density regions.</span></p></li><br />
<li><p>'''Labeling''': Once the ground plane estimation, footline projections and obstacle points are found, ground map <math>G</math>, footline-map <math>F</math> and obstacle-map <math>O</math> can be produced.</p><br />
<p>Conventionally binary classifiers are used for terrain traversability, however, used a classifier that uses 5 labels:</p><br />
<ul><br />
<li><p><span>Super-traversable</span></p></li><br />
<li><p><span>Ground</span></p></li><br />
<li><p><span>Footline</span></p></li><br />
<li><p><span>Obstacle</span></p></li><br />
<li><p><span>Super-obstacle</span></p></li></ul><br />
<br />
[[Image:label_categories.png|frame| center | 400px | alt=|Label Categories <ref name="hadsell2009" /> <span data-label="fig:labelcategories"></span>]]<br />
<br />
<p>Where super-traversable and super-obstacle are high confidence labels that refer to input windows where only ground or obstacles are seen. Lower confidence labels such as ground and obstacle are used when there are mixture of points in the input window. Lastly footline labels are assigned when footline points are centered in the middle of the input window. The label criteria rules used by <ref name="hadsell2009" /> are outlined in figure below</p><br />
[[Image:label_criteria.png|frame| center | 400px | alt=|Label Criteria Rules <ref name="hadsell2009" /> <span data-label="fig:labelcriteria"></span>]]<br />
</li></ol><br />
<br />
== Training and Classification ==<br />
<br />
The real-time classifier is the last stage of the learning process. Due to its real-time nature the classifier has to be simple and efficient, therefore 5 logistic regression classifiers (one for each category) with a Kullback-Liebler divergence or relative entropy loss function and stochastic gradient descent was used. Additionally 5 ring buffer or circular buffer are used to store incoming data from the feature extraction and stereo supervisor. The ring buffer acts as a First In First Out (FIFO) queue and stores temporary data as it is being received and processed. The result is that the classifiers outputs a 5 component likelihood vectors for each input.<br />
<br />
= Experimental Results =<br />
<br />
== Performances of Feature Extractors ==<br />
<br />
[[Image:feature_extractors.png|frame| center | 400px | alt=|Comparision of Feature Extractors <ref name="hadsell2009" /> <span data-label="fig:featureextractors"></span>]]<br />
<br />
For testing the feature extractors, a dataset containing 160 hand labeled frames from over 25 log files were used, the log files can be further divided into 7 groups as seen in figure above, where it is a comparision of the 4 different feature extractors: Radial Basis Functions, Convolutional Neural Network, an Unsupervised Auto-Encoder and finally a supervised Auto-Encoder. In almost all cases it can be observed that the best feature extractor was the CNN trained with Auto-Encoders with the best average error rate of <math>8.46\%</math>.<br />
<br />
== Performances of Stereo Supervisor Module ==<br />
<br />
[[Image:stereo_module_comparison.png|frame| center | 400px | alt=|Stereo Module Performance <ref name="hadsell2009" /> <span data-label="fig:stereomodulecomparison"></span>]]<br />
<br />
To test the stereo module it was compared against the online classifier using the same ground truth dataset used in the previous section. As you can see from figure above the online classifier performs better than the stereo supervisor module, the authors note that it is due to the online classifier ability to smooth and regularize the noisy data <ref name="hadsell2009" />.<br />
<br />
== Field Test ==<br />
<br />
The online classifier was deployed onto a Learning Applied to Ground Robots (LAGR) vehicle provided by the National Robotics Engineering Center (NREC), and tested on three different courses. The system contains 2 processes running simultaneously, a 1-2 Hz online classifier outlined above, and a fast 8 - 10 Hz stereo based obstacle avoidance module. The combination of the both provides good long range and short range obstacle capabilities.<br />
<br />
The system was found to be most effective when long-range online classifier was combined with the short range module, as the short range only has a range of around 5 meters it often required human intervention to rescue the vehicle. No quantitative comparisons were given for these field tests, it is purely subjective and only tested during daytime.<br />
<br />
= Conclusion =<br />
<br />
This paper did not introduce novel ideas per se in terms of deep learning methods, however the application of deep learning methods (CNN + auto-encoders) along with stereo module to train a 5 label classifier shows great promise in increasing the road classification from a max range of 10 - 12 meters with purely stereo vision to over 100 meters is new in 2009 <ref name="hadsell2009" />.<br />
<br />
There were several issues with the experiments I have observed:<br />
<br />
* <span>There were no mention how many times the feature extractors were trained to obtain best parameters, nor the difficulty in training.</span><br />
* <span>All data and tests were performed during daytime, no mention of limitations at night.</span><br />
* <span>This paper did not compare itself against other state of the art systems such as <ref name="hong2002" /> <ref name="lieb2005" /> <ref name="dahlkamp2006" /> other than stereo vision based systems.</span><br />
* <span>In the plot of stereo vision vs online classifier did not contain error bars. Also on the x-axis the groundtruth frames are ordered by error difference, it would be interesting to see what would happen if it was time ordered instead, and whether it would tell us that stereo vision performs well at the beginning but poorly afterwards, supporting the authors claim that an online classifier is able to smooth and regularize the noisy data.</span><br />
* <span>Field tests lacked a quantitative measures to compare between the long range system against the short range system.</span><br />
<br />
= References =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Long-Range_Vision_for_Autonomous_Off-Road_Driving&diff=27356learning Long-Range Vision for Autonomous Off-Road Driving2015-12-18T04:22:50Z<p>Rqiao: /* Feature Extraction */</p>
<hr />
<div>= Introduction =<br />
<br />
Stereo-vision has been used extensively for mobile robots in identifying near-to-far obstacles in its path, but is limited by it's max range of 12 meters. For the safety of high speed mobile robots recognizing obstacles at longer ranges is vital.<br />
<br />
The authors of this paper proposed a "long-range vision vision system that uses self-supervised learning to train a classifier in real-time" <ref name="hadsell2009">Hadsell, Raia, et al. "Learning long‐range vision for autonomous off‐road driving." Journal of Field Robotics 26.2 (2009): 120-144.</ref>; to robustly increase the obstacle and path detection range to over 100 meters. This approach has been implemented and tested on the Learning Applied to Ground Robots (LAGR) provided by the National Robotics Engineering Center (NREC).<br />
<br />
= Related Work =<br />
<br />
A common approach to vision-based driving is to process images captured from a pair of stereo cameras, produce a point cloud and use various heuristics to build a traversability map <ref name="goldberg2002">Goldberg, Steven B., Mark W. Maimone, and Lany Matthies. "Stereo vision and rover navigation software for planetary exploration." Aerospace Conference Proceedings, 2002. IEEE. Vol. 5. IEEE, 2002.</ref> <ref name="kriegman1989">Kriegman, David J., Ernst Triendl, and Thomas O. Binford. "Stereo vision and navigation in buildings for mobile robots." Robotics and Automation, IEEE Transactions on 5.6 (1989): 792-803.</ref> <ref name="kelly1998">Kelly, Alonzo, and Anthony Stentz. "Stereo vision enhancements for low-cost outdoor autonomous vehicles." Int’l Conf. on Robotics and Automation, Workshop WS-7, Navigation of Outdoor Autonomous Vehicles. Vol. 1. 1998.</ref><br />
, there has been efforts to increase the range of stereo vision by using the color of nearby ground and obstacles, but these color based improvements can easily be fooled by shadows, monochromatic terrain and complex obstacles or ground types.<br />
<br />
More recent vision based approaches such as <ref name="hong2002">Hong, Tsai Hong, et al. "Road detection and tracking for autonomous mobile robots." AeroSense 2002 (2002): 311-319.</ref> <ref name="lieb2005">Lieb, David, Andrew Lookingbill, and Sebastian Thrun. "Adaptive Road Following using Self-Supervised Learning and Reverse Optical Flow." Robotics: Science and Systems. 2005.</ref> <ref name="dahlkamp2006">Dahlkamp, Hendrik, et al. "Self-supervised Monocular Road Detection in Desert Terrain." Robotics: science and systems. 2006.</ref> use learning algorithms to map traversability information to color histograms or geometric (point cloud) data has achieved success in the DARPA challenge.<br />
<br />
Other, non-vision-based systems have used the near-to-far learning paradigm to classify distant sensor data based on self-supervision from a reliable, close-range sensor. A self-supervised classifier was trained on satellite imagery and ladar sensor data for the Spinner vehicle’s navigation system<ref><br />
Sofman, Boris, et al. "Improving robot navigation through self‐supervised online learning." Journal of Field Robotics 23.11‐12 (2006): 1059-1075.<br />
</ref><br />
and an online self-supervised classifier for a ladar-based navigation system was trained to predict load-bearing surfaces in the presence of vegetation.<ref><br />
Wellington, Carl, and Anthony Stentz. "Online adaptive rough-terrain navigation vegetation." Robotics and Automation, 2004. Proceedings. ICRA'04. 2004 IEEE International Conference on. Vol. 1. IEEE, 2004.<br />
</ref><br />
<br />
= Challenges =<br />
<br />
* <span>'''Choice of Feature Representation''': For example how to choose a robust feature representation that is informative enough that avoid irrelevant transformations</span><br />
* <span>'''Automatic generation of Training labels''': Because the classifier devised is trained in real-time, it requires a constant stream of training data and labels to learn from.</span><br />
* <span>'''Ability to generalize from near to far field''': Objects captured by the camera scales inversely proportional to the distance away from the camera, therefore the system needs to take this into account and normalize the objects detected.</span><br />
<br />
= Overview of the Learning Process =<br />
<br />
[[Image:method.png|frame| center | 400px | alt=|Learning System Proposed by <ref name="hadsell2009" /> ]]<br />
<br />
The learning process described by is as follows:<br />
<br />
# <span>'''Pre-Processing and Normalization''': This step involves correcting the skewed horizon captured by the camera and normalizing the scale of objects captured by the camera, since objects captured scales inversely proportional to distance away from camera.</span><br />
# <span>'''Feature Extraction''': Convolutional Neural Networks were trained and used to extract features in order to reduce dimensionality.</span><br />
# <span>'''Stereo Supervisor Module''': Complicated procedure that uses multiple ground plane estimation, heuristics and statistical false obstacle filtering to generate class labels to close range objects in the normalized input. The goal is to generate training data for the classifier at the end of this learning process.</span><br />
# <span>'''Training and Classification''': Once the class labels and feature extraction training data is combined, it is fed into the classifier for real-time training. The classifier is trained on every frame and the authors have used stochastic gradient descent to update the classifier weights and cross entropy as the loss function.</span><br />
<br />
== Pre-Processing and Normalization ==<br />
<br />
At the first stage of the learning process there are two issues that needs addressing, namely the skewed horizon due to the roll of camera and terrain, secondly the true scale of objects that appear in the input image, since objects scale inversely proportional to distance away from camera, the objects need to be normalized to represent its true scale.<br />
<br />
[[Image:horizon_pyramid.png|frame| center | 400px | alt=|Horizon Pyramid <ref name="hadsell2009" /> <span data-label="fig:hpyramid"></span>]]<br />
<br />
To solve both issues, a normalized “pyramid” containing 7 sub-images are extracted (see figure above), where the top row of the pyramid has a range from 112 meters to infinity and the closest pyramid row has a range of 4 to 11 meters. These pyramid sub images are extracted and normalized from the input image to form the input for the next stage.<br />
<br />
[[Image:horizon_normalize.png|frame| center | 400px | alt=|Creating target sub-image <ref name="hadsell2009" /> <span data-label="fig:hnorm"></span>]]<br />
<br />
To obtain the scaled and horizon corrected sub images the authors have used a combination of a Hough transform and PCA robust refit to estimate the ground plane <math>P = (p_{r}, p_{c}, p_{d}, p_{o})</math>. Where <math>p_{r}</math> is the roll, <math>p_{c}</math> is the column, <math>p_{d}</math> is the disparity and <math>p_{o}</math> is the offset. Once the ground plane <math>P</math> is estimated, the horizon target sub-image <math>A, B, C, D</math> (see figure above) is computed by calculating the plane <math>\overline{EF}</math> with stereo disparity of <math>d</math> pixels. The following equations were used to calculate the center of the line <math>M</math>, the plane <math>\overline{EF}</math>, rotation <math>\theta</math> and finally points <math>A, B, C, D</math>.<br />
<br />
<math>\textbf{M}_{y} = \frac{p_{c} \textbf{M}_{x} + p_{d} d + p-{o}}{-p_{r}}</math><br />
<br />
<math>E = (<br />
\textbf{M}_{x} - \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} - \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>F = (<br />
\textbf{M}_{x} + \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} + \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>\theta = \left( <br />
\frac{\textbf{w}_{pc} + p_{d} + p_{o}}{-p_{r}}<br />
- \frac{p_{d} + p_{o}}{-p_{r}} / w<br />
\right)</math><br />
<br />
<math>A = (<br />
\textbf{E}_{x} + \alpha \sin \theta,<br />
\textbf{E}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>B = (<br />
\textbf{F}_{x} + \alpha \sin \theta,<br />
\textbf{F}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>C = (<br />
\textbf{F}_{x} - \alpha \sin \theta,<br />
\textbf{F}_{y} + \alpha \cos \theta,<br />
)</math><br />
<br />
<math>D = (<br />
\textbf{E}_{x} - \alpha \sin \theta,<br />
\textbf{E}_{y} + \alpha \cos \theta,invariance<br />
)</math><br />
<br />
The last step of this stage is that the images were converted from RGB to YUV, common in image processing pipelines.<br />
<br />
== Feature Extraction ==<br />
<br />
The goal of the feature extraction is to reduce the input dimensionality and increase the generality of the resulting classifier to be trained. Instead of using hand-tuned feature list, <ref name="hadsell2009" /> used a data driven approach and trained 4 different feature extractors, this is the only component of the learning process where it is trained off-line.<br />
<br />
<ul><br />
<li><p>'''Radial Basis Functions (RBF)''': A set of RBF were learned to form a feature vector by calculating the Euclidean distance between input window and each of the 100 RBF centers. Where each feature vector <math>D</math> has the form:</p><br />
<p><math>D_{j} = exp(-\beta^{i} || X - K^{i} ||^{2}_{2})</math></p><br />
<p>Where <math>\beta^{i}</math> is the inverse variance of the RBF center <math>K^{i}</math>, <math>X</math> is the input window, <math>K</math> is the set of <math>n</math> radial basis centers <math>K = \{K^{i} | i = 1 \dots n\}</math>.</p></li><br />
<li><p>'''Convolution Neural Network (CNN)''': A standard CNN was used, the architecture consisted of two layers, the first has 20 7x6 filters and the second has 369 6x5 filters. During training a 100 fully connected hidden neuron layer is added as a last layer to train with 5 outputs. Once the network is trained however that last layer was removed, and thus the resulting CNN outputs a 100 component feature vector. For training the authors random initialized the weights, used stochastic graident decent for 30 epochs, and <math>L^2</math> regularization. The network was trained against 450,000 labeled image patches, and tested against 50,000 labeled patches.</p></li><br />
<li><p>'''Supervised and Unsupervised Auto-Encoders''': Auto-Encoders or Deep Belief Networks <ref name="hinton2006">Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.</ref> <ref name="ranzato2007">Ranzato, Marc Aurelio, et al. "Unsupervised learning of invariant feature hierarchies with applications to object recognition." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.</ref> is a layer wise training procedure. The deep belief net trained has 3 layers, where the first and third are convolutional layers, and the second one is a maxpool layer, the architecture is explained in figure below. Since the encoder contains a maxpool layer, the decoder should have an unpool layer, but the author didn't specify which kind of unpool technique they use.</p><br />
[[Image:convo_arch.png|frame| center | 400px | alt=|Convolution Neural Network <ref name="hadsell2009" /> <span data-label="fig:convoarch"></span>]]<br />
<br />
<p>For training, the loss function is the mean squared loss between original input and decoded picture. At first the network is trained with 10,000 unlabeled images (unsupervised training) with varying outdoor settings (150 settings), then the network is fined tuned with labeled dataset (supervised training), the authors did not mention how large the labeled dataset was, and what training parameters were used for the supervised stage.</p></li></ul><br />
<br />
== Stereo Supervisor Module ==<br />
<br />
[[Image:ground_plane_estimation.png|frame| center | 400px | alt=|Ground Plane Estimation <ref name="hadsell2009" /> <span data-label="fig:gplanes"></span>]]<br />
<br />
Once the images have been preprocessed and normalized, stereo vision algorithms are used to produce data samples and labels that are “visually consistent, error free and well distributed”. There are 4 steps at this stage:<br />
<br />
<ol><br />
<li><p><span>'''3D point cloud''': Step one, a 3D point cloud is produced by using the Triclops stereo vision algorithm from Point Grey Research. The algorithm used has a range of 12 to 15 meters and works by triangulating objects between two images to find the depth.</span></p></li><br />
<li><p>'''Estimation of ground plane''': Secondly a ground plane model is found by using a combination of Hough transform and principle component analysis (PCA) to fit a plane onto the point cloud <math>S = \{ (x^{i}, y^{i}, z^{i}) | i = 1 \dots n) \} </math>. Where <math>x^{i}, y^{i}, z^{i}</math> defines the position of point relative to the robot’s center, and <math>n</math> is the number of points in the point cloud.</p><br />
<p>The rational behind using Hough transform is since multiple ground planes can be found (see figure above), a voting system was introduced where by the parameter vector which denotes the ground plane parameter (such as pitch, roll and offset) and has the most votes is used. It is selected by the following equation:</p><br />
<p><math>X = P_{ijk} | i, j, k = argmax_{i,j,k} (V_{ijk})</math></p><br />
<p>Where <math>X</math> is the new plane estimate, <math>V</math> is a tensor that accumulates the votes and <math>P</math> is a tensor that records the plane parameter space. Then PCA is used to refit and compute the eigenvalue decomposition of the covariance matrix of the points <math>X^{1 \dots n}</math>.</p><br />
<p><math>\frac{1}{n} \sum^{n}_{1} X^{i} X^{i'} = Q \Lambda Q</math></p><br />
<p>It should be noted, however, that multiple ground planes does not eliminate all errors from the labeling process. The authors of this paper used the following heuristics to minimize the errors in the training data. The heuristic is and I quote:<br />
<br />
<blockquote><br />
{{Quote|text=}} <ref name="hadsell2009" /><br />
</blockquote><br />
<br />
<li><span>'''Projection''': Stereo vision has the limitation of only being able to robustly detect short range (12m max) objects. In an attempt to mitigate the uncertainty of long range objects, footlines of obstacles (the bottom outline of the obstacle) are used. This gives stereo vision better estimates about the scale and distance of long range objects. The footline of long range objects are found by projecting obstacle points onto the ground planes and marking high point-density regions.</span></p></li><br />
<li><p>'''Labeling''': Once the ground plane estimation, footline projections and obstacle points are found, ground map <math>G</math>, footline-map <math>F</math> and obstacle-map <math>O</math> can be produced.</p><br />
<p>Conventionally binary classifiers are used for terrain traversability, however, used a classifier that uses 5 labels:</p><br />
<ul><br />
<li><p><span>Super-traversable</span></p></li><br />
<li><p><span>Ground</span></p></li><br />
<li><p><span>Footline</span></p></li><br />
<li><p><span>Obstacle</span></p></li><br />
<li><p><span>Super-obstacle</span></p></li></ul><br />
<br />
[[Image:label_categories.png|frame| center | 400px | alt=|Label Categories <ref name="hadsell2009" /> <span data-label="fig:labelcategories"></span>]]<br />
<br />
<p>Where super-traversable and super-obstacle are high confidence labels that refer to input windows where only ground or obstacles are seen. Lower confidence labels such as ground and obstacle are used when there are mixture of points in the input window. Lastly footline labels are assigned when footline points are centered in the middle of the input window. The label criteria rules used by <ref name="hadsell2009" /> are outlined in figure below</p><br />
[[Image:label_criteria.png|frame| center | 400px | alt=|Label Criteria Rules <ref name="hadsell2009" /> <span data-label="fig:labelcriteria"></span>]]<br />
</li></ol><br />
<br />
== Training and Classification ==<br />
<br />
The real-time classifier is the last stage of the learning process. Due to its real-time nature the classifier has to be simple and efficient, therefore 5 logistic regression classifiers (one for each category) with a Kullback-Liebler divergence or relative entropy loss function and stochastic gradient descent was used. Additionally 5 ring buffer or circular buffer are used to store incoming data from the feature extraction and stereo supervisor. The ring buffer acts as a First In First Out (FIFO) queue and stores temporary data as it is being received and processed. The result is that the classifiers outputs a 5 component likelihood vectors for each input.<br />
<br />
= Experimental Results =<br />
<br />
== Performances of Feature Extractors ==<br />
<br />
[[Image:feature_extractors.png|frame| center | 400px | alt=|Comparision of Feature Extractors <ref name="hadsell2009" /> <span data-label="fig:featureextractors"></span>]]<br />
<br />
For testing the feature extractors, a dataset containing 160 hand labeled frames from over 25 log files were used, the log files can be further divided into 7 groups as seen in figure above, where it is a comparision of the 4 different feature extractors: Radial Basis Functions, Convolutional Neural Network, an Unsupervised Auto-Encoder and finally a supervised Auto-Encoder. In almost all cases it can be observed that the best feature extractor was the CNN trained with Auto-Encoders with the best average error rate of <math>8.46\%</math>.<br />
<br />
== Performances of Stereo Supervisor Module ==<br />
<br />
[[Image:stereo_module_comparison.png|frame| center | 400px | alt=|Stereo Module Performance <ref name="hadsell2009" /> <span data-label="fig:stereomodulecomparison"></span>]]<br />
<br />
To test the stereo module it was compared against the online classifier using the same ground truth dataset used in the previous section. As you can see from figure above the online classifier performs better than the stereo supervisor module, the authors note that it is due to the online classifier ability to smooth and regularize the noisy data <ref name="hadsell2009" />.<br />
<br />
== Field Test ==<br />
<br />
The online classifier was deployed onto a Learning Applied to Ground Robots (LAGR) vehicle provided by the National Robotics Engineering Center (NREC), and tested on three different courses. The system contains 2 processes running simultaneously, a 1-2 Hz online classifier outlined above, and a fast 8 - 10 Hz stereo based obstacle avoidance module. The combination of the both provides good long range and short range obstacle capabilities.<br />
<br />
The system was found to be most effective when long-range online classifier was combined with the short range module, as the short range only has a range of around 5 meters it often required human intervention to rescue the vehicle. No quantitative comparisons were given for these field tests, it is purely subjective and only tested during daytime.<br />
<br />
= Conclusion =<br />
<br />
This paper did not introduce novel ideas per se in terms of deep learning methods, however the application of deep learning methods (CNN + auto-encoders) along with stereo module to train a 5 label classifier shows great promise in increasing the road classification from a max range of 10 - 12 meters with purely stereo vision to over 100 meters is new in 2009 <ref name="hadsell2009" />.<br />
<br />
There were several issues with the experiments I have observed:<br />
<br />
* <span>There were no mention how many times the feature extractors were trained to obtain best parameters, nor the difficulty in training.</span><br />
* <span>All data and tests were performed during daytime, no mention of limitations at night.</span><br />
* <span>This paper did not compare itself against other state of the art systems such as <ref name="hong2002" /> <ref name="lieb2005" /> <ref name="dahlkamp2006" /> other than stereo vision based systems.</span><br />
* <span>In the plot of stereo vision vs online classifier did not contain error bars. Also on the x-axis the groundtruth frames are ordered by error difference, it would be interesting to see what would happen if it was time ordered instead, and whether it would tell us that stereo vision performs well at the beginning but poorly afterwards, supporting the authors claim that an online classifier is able to smooth and regularize the noisy data.</span><br />
* <span>Field tests lacked a quantitative measures to compare between the long range system against the short range system.</span><br />
<br />
= References =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Long-Range_Vision_for_Autonomous_Off-Road_Driving&diff=27355learning Long-Range Vision for Autonomous Off-Road Driving2015-12-18T04:17:42Z<p>Rqiao: /* Related Work */</p>
<hr />
<div>= Introduction =<br />
<br />
Stereo-vision has been used extensively for mobile robots in identifying near-to-far obstacles in its path, but is limited by it's max range of 12 meters. For the safety of high speed mobile robots recognizing obstacles at longer ranges is vital.<br />
<br />
The authors of this paper proposed a "long-range vision vision system that uses self-supervised learning to train a classifier in real-time" <ref name="hadsell2009">Hadsell, Raia, et al. "Learning long‐range vision for autonomous off‐road driving." Journal of Field Robotics 26.2 (2009): 120-144.</ref>; to robustly increase the obstacle and path detection range to over 100 meters. This approach has been implemented and tested on the Learning Applied to Ground Robots (LAGR) provided by the National Robotics Engineering Center (NREC).<br />
<br />
= Related Work =<br />
<br />
A common approach to vision-based driving is to process images captured from a pair of stereo cameras, produce a point cloud and use various heuristics to build a traversability map <ref name="goldberg2002">Goldberg, Steven B., Mark W. Maimone, and Lany Matthies. "Stereo vision and rover navigation software for planetary exploration." Aerospace Conference Proceedings, 2002. IEEE. Vol. 5. IEEE, 2002.</ref> <ref name="kriegman1989">Kriegman, David J., Ernst Triendl, and Thomas O. Binford. "Stereo vision and navigation in buildings for mobile robots." Robotics and Automation, IEEE Transactions on 5.6 (1989): 792-803.</ref> <ref name="kelly1998">Kelly, Alonzo, and Anthony Stentz. "Stereo vision enhancements for low-cost outdoor autonomous vehicles." Int’l Conf. on Robotics and Automation, Workshop WS-7, Navigation of Outdoor Autonomous Vehicles. Vol. 1. 1998.</ref><br />
, there has been efforts to increase the range of stereo vision by using the color of nearby ground and obstacles, but these color based improvements can easily be fooled by shadows, monochromatic terrain and complex obstacles or ground types.<br />
<br />
More recent vision based approaches such as <ref name="hong2002">Hong, Tsai Hong, et al. "Road detection and tracking for autonomous mobile robots." AeroSense 2002 (2002): 311-319.</ref> <ref name="lieb2005">Lieb, David, Andrew Lookingbill, and Sebastian Thrun. "Adaptive Road Following using Self-Supervised Learning and Reverse Optical Flow." Robotics: Science and Systems. 2005.</ref> <ref name="dahlkamp2006">Dahlkamp, Hendrik, et al. "Self-supervised Monocular Road Detection in Desert Terrain." Robotics: science and systems. 2006.</ref> use learning algorithms to map traversability information to color histograms or geometric (point cloud) data has achieved success in the DARPA challenge.<br />
<br />
Other, non-vision-based systems have used the near-to-far learning paradigm to classify distant sensor data based on self-supervision from a reliable, close-range sensor. A self-supervised classifier was trained on satellite imagery and ladar sensor data for the Spinner vehicle’s navigation system<ref><br />
Sofman, Boris, et al. "Improving robot navigation through self‐supervised online learning." Journal of Field Robotics 23.11‐12 (2006): 1059-1075.<br />
</ref><br />
and an online self-supervised classifier for a ladar-based navigation system was trained to predict load-bearing surfaces in the presence of vegetation.<ref><br />
Wellington, Carl, and Anthony Stentz. "Online adaptive rough-terrain navigation vegetation." Robotics and Automation, 2004. Proceedings. ICRA'04. 2004 IEEE International Conference on. Vol. 1. IEEE, 2004.<br />
</ref><br />
<br />
= Challenges =<br />
<br />
* <span>'''Choice of Feature Representation''': For example how to choose a robust feature representation that is informative enough that avoid irrelevant transformations</span><br />
* <span>'''Automatic generation of Training labels''': Because the classifier devised is trained in real-time, it requires a constant stream of training data and labels to learn from.</span><br />
* <span>'''Ability to generalize from near to far field''': Objects captured by the camera scales inversely proportional to the distance away from the camera, therefore the system needs to take this into account and normalize the objects detected.</span><br />
<br />
= Overview of the Learning Process =<br />
<br />
[[Image:method.png|frame| center | 400px | alt=|Learning System Proposed by <ref name="hadsell2009" /> ]]<br />
<br />
The learning process described by is as follows:<br />
<br />
# <span>'''Pre-Processing and Normalization''': This step involves correcting the skewed horizon captured by the camera and normalizing the scale of objects captured by the camera, since objects captured scales inversely proportional to distance away from camera.</span><br />
# <span>'''Feature Extraction''': Convolutional Neural Networks were trained and used to extract features in order to reduce dimensionality.</span><br />
# <span>'''Stereo Supervisor Module''': Complicated procedure that uses multiple ground plane estimation, heuristics and statistical false obstacle filtering to generate class labels to close range objects in the normalized input. The goal is to generate training data for the classifier at the end of this learning process.</span><br />
# <span>'''Training and Classification''': Once the class labels and feature extraction training data is combined, it is fed into the classifier for real-time training. The classifier is trained on every frame and the authors have used stochastic gradient descent to update the classifier weights and cross entropy as the loss function.</span><br />
<br />
== Pre-Processing and Normalization ==<br />
<br />
At the first stage of the learning process there are two issues that needs addressing, namely the skewed horizon due to the roll of camera and terrain, secondly the true scale of objects that appear in the input image, since objects scale inversely proportional to distance away from camera, the objects need to be normalized to represent its true scale.<br />
<br />
[[Image:horizon_pyramid.png|frame| center | 400px | alt=|Horizon Pyramid <ref name="hadsell2009" /> <span data-label="fig:hpyramid"></span>]]<br />
<br />
To solve both issues, a normalized “pyramid” containing 7 sub-images are extracted (see figure above), where the top row of the pyramid has a range from 112 meters to infinity and the closest pyramid row has a range of 4 to 11 meters. These pyramid sub images are extracted and normalized from the input image to form the input for the next stage.<br />
<br />
[[Image:horizon_normalize.png|frame| center | 400px | alt=|Creating target sub-image <ref name="hadsell2009" /> <span data-label="fig:hnorm"></span>]]<br />
<br />
To obtain the scaled and horizon corrected sub images the authors have used a combination of a Hough transform and PCA robust refit to estimate the ground plane <math>P = (p_{r}, p_{c}, p_{d}, p_{o})</math>. Where <math>p_{r}</math> is the roll, <math>p_{c}</math> is the column, <math>p_{d}</math> is the disparity and <math>p_{o}</math> is the offset. Once the ground plane <math>P</math> is estimated, the horizon target sub-image <math>A, B, C, D</math> (see figure above) is computed by calculating the plane <math>\overline{EF}</math> with stereo disparity of <math>d</math> pixels. The following equations were used to calculate the center of the line <math>M</math>, the plane <math>\overline{EF}</math>, rotation <math>\theta</math> and finally points <math>A, B, C, D</math>.<br />
<br />
<math>\textbf{M}_{y} = \frac{p_{c} \textbf{M}_{x} + p_{d} d + p-{o}}{-p_{r}}</math><br />
<br />
<math>E = (<br />
\textbf{M}_{x} - \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} - \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>F = (<br />
\textbf{M}_{x} + \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} + \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>\theta = \left( <br />
\frac{\textbf{w}_{pc} + p_{d} + p_{o}}{-p_{r}}<br />
- \frac{p_{d} + p_{o}}{-p_{r}} / w<br />
\right)</math><br />
<br />
<math>A = (<br />
\textbf{E}_{x} + \alpha \sin \theta,<br />
\textbf{E}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>B = (<br />
\textbf{F}_{x} + \alpha \sin \theta,<br />
\textbf{F}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>C = (<br />
\textbf{F}_{x} - \alpha \sin \theta,<br />
\textbf{F}_{y} + \alpha \cos \theta,<br />
)</math><br />
<br />
<math>D = (<br />
\textbf{E}_{x} - \alpha \sin \theta,<br />
\textbf{E}_{y} + \alpha \cos \theta,invariance<br />
)</math><br />
<br />
The last step of this stage is that the images were converted from RGB to YUV, common in image processing pipelines.<br />
<br />
== Feature Extraction ==<br />
<br />
The goal of the feature extraction is to reduce the input dimensionality and increase the generality of the resulting classifier to be trained. Instead of using hand-tuned feature list, <ref name="hadsell2009" /> used a data driven approach and trained 4 different feature extractors, this is the only component of the learning process where it is trained off-line.<br />
<br />
<ul><br />
<li><p>'''Radial Basis Functions (RBF)''': A set of RBF were learned to form a feature vector by calculating the Euclidean distance between input window and each of the 100 RBF centers. Where each feature vector <math>D</math> has the form:</p><br />
<p><math>D_{j} = exp(-\beta^{i} || X - K^{i} ||^{2}_{2})</math></p><br />
<p>Where <math>\beta^{i}</math> is the inverse variance of the RBF center <math>K^{i}</math>, <math>X</math> is the input window, <math>K</math> is the set of <math>n</math> radial basis centers <math>K = \{K^{i} | i = 1 \dots n\}</math>.</p></li><br />
<li><p>'''Convolution Neural Network (CNN)''': A standard CNN was used, the architecture consisted of two layers, the first has 20 7x6 filters and the second has 369 6x5 filters. During training a 100 fully connected hidden neuron layer is added as a last layer to train with 5 outputs. Once the network is trained however that last layer was removed, and thus the resulting CNN outputs a 100 component feature vector. For training the authors random initialized the weights, used stochastic graident decent for 30 epochs, and <math>L^2</math> regularization. The network was trained against 450,000 labeled image patches, and tested against 50,000 labeled patches.</p></li><br />
<li><p>'''Supervised and Unsupervised Auto-Encoders''': Auto-Encoders or Deep Belief Networks <ref name="hinton2006">Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.</ref> <ref name="ranzato2007">Ranzato, Marc Aurelio, et al. "Unsupervised learning of invariant feature hierarchies with applications to object recognition." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.</ref> is a layer wise training procedure. The deep belief net trained has 3 layers, where the first and third are convolutional layers, and the second one is a maxpool layer, the architecture is explained in figure below.</p><br />
[[Image:convo_arch.png|frame| center | 400px | alt=|Convolution Neural Network <ref name="hadsell2009" /> <span data-label="fig:convoarch"></span>]]<br />
<br />
<p>For training, the loss function is mean squared loss. At first the network is trained with 10,000 unlabeled images (unsupervised training) with varying outdoor settings (150 settings), then the network is fined tuned with labeled dataset (supervised training), the authors did not mention how large the labeled dataset was, and what training parameters were used for the supervised stage.</p></li></ul><br />
<br />
== Stereo Supervisor Module ==<br />
<br />
[[Image:ground_plane_estimation.png|frame| center | 400px | alt=|Ground Plane Estimation <ref name="hadsell2009" /> <span data-label="fig:gplanes"></span>]]<br />
<br />
Once the images have been preprocessed and normalized, stereo vision algorithms are used to produce data samples and labels that are “visually consistent, error free and well distributed”. There are 4 steps at this stage:<br />
<br />
<ol><br />
<li><p><span>'''3D point cloud''': Step one, a 3D point cloud is produced by using the Triclops stereo vision algorithm from Point Grey Research. The algorithm used has a range of 12 to 15 meters and works by triangulating objects between two images to find the depth.</span></p></li><br />
<li><p>'''Estimation of ground plane''': Secondly a ground plane model is found by using a combination of Hough transform and principle component analysis (PCA) to fit a plane onto the point cloud <math>S = \{ (x^{i}, y^{i}, z^{i}) | i = 1 \dots n) \} </math>. Where <math>x^{i}, y^{i}, z^{i}</math> defines the position of point relative to the robot’s center, and <math>n</math> is the number of points in the point cloud.</p><br />
<p>The rational behind using Hough transform is since multiple ground planes can be found (see figure above), a voting system was introduced where by the parameter vector which denotes the ground plane parameter (such as pitch, roll and offset) and has the most votes is used. It is selected by the following equation:</p><br />
<p><math>X = P_{ijk} | i, j, k = argmax_{i,j,k} (V_{ijk})</math></p><br />
<p>Where <math>X</math> is the new plane estimate, <math>V</math> is a tensor that accumulates the votes and <math>P</math> is a tensor that records the plane parameter space. Then PCA is used to refit and compute the eigenvalue decomposition of the covariance matrix of the points <math>X^{1 \dots n}</math>.</p><br />
<p><math>\frac{1}{n} \sum^{n}_{1} X^{i} X^{i'} = Q \Lambda Q</math></p><br />
<p>It should be noted, however, that multiple ground planes does not eliminate all errors from the labeling process. The authors of this paper used the following heuristics to minimize the errors in the training data. The heuristic is and I quote:<br />
<br />
<blockquote><br />
{{Quote|text=}} <ref name="hadsell2009" /><br />
</blockquote><br />
<br />
<li><span>'''Projection''': Stereo vision has the limitation of only being able to robustly detect short range (12m max) objects. In an attempt to mitigate the uncertainty of long range objects, footlines of obstacles (the bottom outline of the obstacle) are used. This gives stereo vision better estimates about the scale and distance of long range objects. The footline of long range objects are found by projecting obstacle points onto the ground planes and marking high point-density regions.</span></p></li><br />
<li><p>'''Labeling''': Once the ground plane estimation, footline projections and obstacle points are found, ground map <math>G</math>, footline-map <math>F</math> and obstacle-map <math>O</math> can be produced.</p><br />
<p>Conventionally binary classifiers are used for terrain traversability, however, used a classifier that uses 5 labels:</p><br />
<ul><br />
<li><p><span>Super-traversable</span></p></li><br />
<li><p><span>Ground</span></p></li><br />
<li><p><span>Footline</span></p></li><br />
<li><p><span>Obstacle</span></p></li><br />
<li><p><span>Super-obstacle</span></p></li></ul><br />
<br />
[[Image:label_categories.png|frame| center | 400px | alt=|Label Categories <ref name="hadsell2009" /> <span data-label="fig:labelcategories"></span>]]<br />
<br />
<p>Where super-traversable and super-obstacle are high confidence labels that refer to input windows where only ground or obstacles are seen. Lower confidence labels such as ground and obstacle are used when there are mixture of points in the input window. Lastly footline labels are assigned when footline points are centered in the middle of the input window. The label criteria rules used by <ref name="hadsell2009" /> are outlined in figure below</p><br />
[[Image:label_criteria.png|frame| center | 400px | alt=|Label Criteria Rules <ref name="hadsell2009" /> <span data-label="fig:labelcriteria"></span>]]<br />
</li></ol><br />
<br />
== Training and Classification ==<br />
<br />
The real-time classifier is the last stage of the learning process. Due to its real-time nature the classifier has to be simple and efficient, therefore 5 logistic regression classifiers (one for each category) with a Kullback-Liebler divergence or relative entropy loss function and stochastic gradient descent was used. Additionally 5 ring buffer or circular buffer are used to store incoming data from the feature extraction and stereo supervisor. The ring buffer acts as a First In First Out (FIFO) queue and stores temporary data as it is being received and processed. The result is that the classifiers outputs a 5 component likelihood vectors for each input.<br />
<br />
= Experimental Results =<br />
<br />
== Performances of Feature Extractors ==<br />
<br />
[[Image:feature_extractors.png|frame| center | 400px | alt=|Comparision of Feature Extractors <ref name="hadsell2009" /> <span data-label="fig:featureextractors"></span>]]<br />
<br />
For testing the feature extractors, a dataset containing 160 hand labeled frames from over 25 log files were used, the log files can be further divided into 7 groups as seen in figure above, where it is a comparision of the 4 different feature extractors: Radial Basis Functions, Convolutional Neural Network, an Unsupervised Auto-Encoder and finally a supervised Auto-Encoder. In almost all cases it can be observed that the best feature extractor was the CNN trained with Auto-Encoders with the best average error rate of <math>8.46\%</math>.<br />
<br />
== Performances of Stereo Supervisor Module ==<br />
<br />
[[Image:stereo_module_comparison.png|frame| center | 400px | alt=|Stereo Module Performance <ref name="hadsell2009" /> <span data-label="fig:stereomodulecomparison"></span>]]<br />
<br />
To test the stereo module it was compared against the online classifier using the same ground truth dataset used in the previous section. As you can see from figure above the online classifier performs better than the stereo supervisor module, the authors note that it is due to the online classifier ability to smooth and regularize the noisy data <ref name="hadsell2009" />.<br />
<br />
== Field Test ==<br />
<br />
The online classifier was deployed onto a Learning Applied to Ground Robots (LAGR) vehicle provided by the National Robotics Engineering Center (NREC), and tested on three different courses. The system contains 2 processes running simultaneously, a 1-2 Hz online classifier outlined above, and a fast 8 - 10 Hz stereo based obstacle avoidance module. The combination of the both provides good long range and short range obstacle capabilities.<br />
<br />
The system was found to be most effective when long-range online classifier was combined with the short range module, as the short range only has a range of around 5 meters it often required human intervention to rescue the vehicle. No quantitative comparisons were given for these field tests, it is purely subjective and only tested during daytime.<br />
<br />
= Conclusion =<br />
<br />
This paper did not introduce novel ideas per se in terms of deep learning methods, however the application of deep learning methods (CNN + auto-encoders) along with stereo module to train a 5 label classifier shows great promise in increasing the road classification from a max range of 10 - 12 meters with purely stereo vision to over 100 meters is new in 2009 <ref name="hadsell2009" />.<br />
<br />
There were several issues with the experiments I have observed:<br />
<br />
* <span>There were no mention how many times the feature extractors were trained to obtain best parameters, nor the difficulty in training.</span><br />
* <span>All data and tests were performed during daytime, no mention of limitations at night.</span><br />
* <span>This paper did not compare itself against other state of the art systems such as <ref name="hong2002" /> <ref name="lieb2005" /> <ref name="dahlkamp2006" /> other than stereo vision based systems.</span><br />
* <span>In the plot of stereo vision vs online classifier did not contain error bars. Also on the x-axis the groundtruth frames are ordered by error difference, it would be interesting to see what would happen if it was time ordered instead, and whether it would tell us that stereo vision performs well at the beginning but poorly afterwards, supporting the authors claim that an online classifier is able to smooth and regularize the noisy data.</span><br />
* <span>Field tests lacked a quantitative measures to compare between the long range system against the short range system.</span><br />
<br />
= References =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Long-Range_Vision_for_Autonomous_Off-Road_Driving&diff=27354learning Long-Range Vision for Autonomous Off-Road Driving2015-12-18T04:16:03Z<p>Rqiao: /* Related Work */</p>
<hr />
<div>= Introduction =<br />
<br />
Stereo-vision has been used extensively for mobile robots in identifying near-to-far obstacles in its path, but is limited by it's max range of 12 meters. For the safety of high speed mobile robots recognizing obstacles at longer ranges is vital.<br />
<br />
The authors of this paper proposed a "long-range vision vision system that uses self-supervised learning to train a classifier in real-time" <ref name="hadsell2009">Hadsell, Raia, et al. "Learning long‐range vision for autonomous off‐road driving." Journal of Field Robotics 26.2 (2009): 120-144.</ref>; to robustly increase the obstacle and path detection range to over 100 meters. This approach has been implemented and tested on the Learning Applied to Ground Robots (LAGR) provided by the National Robotics Engineering Center (NREC).<br />
<br />
= Related Work =<br />
<br />
A common approach to vision-based driving is to process images captured from a pair of stereo cameras, produce a point cloud and use various heuristics to build a traversability map <ref name="goldberg2002">Goldberg, Steven B., Mark W. Maimone, and Lany Matthies. "Stereo vision and rover navigation software for planetary exploration." Aerospace Conference Proceedings, 2002. IEEE. Vol. 5. IEEE, 2002.</ref> <ref name="kriegman1989">Kriegman, David J., Ernst Triendl, and Thomas O. Binford. "Stereo vision and navigation in buildings for mobile robots." Robotics and Automation, IEEE Transactions on 5.6 (1989): 792-803.</ref> <ref name="kelly1998">Kelly, Alonzo, and Anthony Stentz. "Stereo vision enhancements for low-cost outdoor autonomous vehicles." Int’l Conf. on Robotics and Automation, Workshop WS-7, Navigation of Outdoor Autonomous Vehicles. Vol. 1. 1998.</ref><br />
, there has been efforts to increase the range of stereo vision by using the color of nearby ground and obstacles, but these color based improvements can easily be fooled by shadows, monochromatic terrain and complex obstacles or ground types.<br />
<br />
More recent vision based approaches such as <ref name="hong2002">Hong, Tsai Hong, et al. "Road detection and tracking for autonomous mobile robots." AeroSense 2002 (2002): 311-319.</ref> <ref name="lieb2005">Lieb, David, Andrew Lookingbill, and Sebastian Thrun. "Adaptive Road Following using Self-Supervised Learning and Reverse Optical Flow." Robotics: Science and Systems. 2005.</ref> <ref name="dahlkamp2006">Dahlkamp, Hendrik, et al. "Self-supervised Monocular Road Detection in Desert Terrain." Robotics: science and systems. 2006.</ref> use learning algorithms to map traversability information to color histograms or geometric (point cloud) data has achieved success in the DARPA challenge.<br />
<br />
Other, non-vision-based systems have used the near-to-far learning paradigm to classify distant sensor data based on self-supervision from a reliable, close-range sensor. A self-supervised classifier was trained on satellite imagery and ladar sensor data for the Spinner vehicle’s navigation system<ref><br />
Sofman, Boris, et al. "Improving robot navigation through self‐supervised online learning." Journal of Field Robotics 23.11‐12 (2006): 1059-1075.<br />
</ref><br />
and an online self-supervised classifier for a ladar-based navigation system was trained to predict load-bearing surfaces in the presence of vegetation<ref><br />
Wellington, Carl, and Anthony Stentz. "Online adaptive rough-terrain navigation vegetation." Robotics and Automation, 2004. Proceedings. ICRA'04. 2004 IEEE International Conference on. Vol. 1. IEEE, 2004.<br />
</ref><br />
<br />
= Challenges =<br />
<br />
* <span>'''Choice of Feature Representation''': For example how to choose a robust feature representation that is informative enough that avoid irrelevant transformations</span><br />
* <span>'''Automatic generation of Training labels''': Because the classifier devised is trained in real-time, it requires a constant stream of training data and labels to learn from.</span><br />
* <span>'''Ability to generalize from near to far field''': Objects captured by the camera scales inversely proportional to the distance away from the camera, therefore the system needs to take this into account and normalize the objects detected.</span><br />
<br />
= Overview of the Learning Process =<br />
<br />
[[Image:method.png|frame| center | 400px | alt=|Learning System Proposed by <ref name="hadsell2009" /> ]]<br />
<br />
The learning process described by is as follows:<br />
<br />
# <span>'''Pre-Processing and Normalization''': This step involves correcting the skewed horizon captured by the camera and normalizing the scale of objects captured by the camera, since objects captured scales inversely proportional to distance away from camera.</span><br />
# <span>'''Feature Extraction''': Convolutional Neural Networks were trained and used to extract features in order to reduce dimensionality.</span><br />
# <span>'''Stereo Supervisor Module''': Complicated procedure that uses multiple ground plane estimation, heuristics and statistical false obstacle filtering to generate class labels to close range objects in the normalized input. The goal is to generate training data for the classifier at the end of this learning process.</span><br />
# <span>'''Training and Classification''': Once the class labels and feature extraction training data is combined, it is fed into the classifier for real-time training. The classifier is trained on every frame and the authors have used stochastic gradient descent to update the classifier weights and cross entropy as the loss function.</span><br />
<br />
== Pre-Processing and Normalization ==<br />
<br />
At the first stage of the learning process there are two issues that needs addressing, namely the skewed horizon due to the roll of camera and terrain, secondly the true scale of objects that appear in the input image, since objects scale inversely proportional to distance away from camera, the objects need to be normalized to represent its true scale.<br />
<br />
[[Image:horizon_pyramid.png|frame| center | 400px | alt=|Horizon Pyramid <ref name="hadsell2009" /> <span data-label="fig:hpyramid"></span>]]<br />
<br />
To solve both issues, a normalized “pyramid” containing 7 sub-images are extracted (see figure above), where the top row of the pyramid has a range from 112 meters to infinity and the closest pyramid row has a range of 4 to 11 meters. These pyramid sub images are extracted and normalized from the input image to form the input for the next stage.<br />
<br />
[[Image:horizon_normalize.png|frame| center | 400px | alt=|Creating target sub-image <ref name="hadsell2009" /> <span data-label="fig:hnorm"></span>]]<br />
<br />
To obtain the scaled and horizon corrected sub images the authors have used a combination of a Hough transform and PCA robust refit to estimate the ground plane <math>P = (p_{r}, p_{c}, p_{d}, p_{o})</math>. Where <math>p_{r}</math> is the roll, <math>p_{c}</math> is the column, <math>p_{d}</math> is the disparity and <math>p_{o}</math> is the offset. Once the ground plane <math>P</math> is estimated, the horizon target sub-image <math>A, B, C, D</math> (see figure above) is computed by calculating the plane <math>\overline{EF}</math> with stereo disparity of <math>d</math> pixels. The following equations were used to calculate the center of the line <math>M</math>, the plane <math>\overline{EF}</math>, rotation <math>\theta</math> and finally points <math>A, B, C, D</math>.<br />
<br />
<math>\textbf{M}_{y} = \frac{p_{c} \textbf{M}_{x} + p_{d} d + p-{o}}{-p_{r}}</math><br />
<br />
<math>E = (<br />
\textbf{M}_{x} - \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} - \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>F = (<br />
\textbf{M}_{x} + \textbf{M}_{x} \cos{\theta},<br />
\textbf{M}_{y} + \textbf{M}_{y} \sin{\theta},<br />
)</math><br />
<br />
<math>\theta = \left( <br />
\frac{\textbf{w}_{pc} + p_{d} + p_{o}}{-p_{r}}<br />
- \frac{p_{d} + p_{o}}{-p_{r}} / w<br />
\right)</math><br />
<br />
<math>A = (<br />
\textbf{E}_{x} + \alpha \sin \theta,<br />
\textbf{E}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>B = (<br />
\textbf{F}_{x} + \alpha \sin \theta,<br />
\textbf{F}_{y} - \alpha \cos \theta,<br />
)</math><br />
<br />
<math>C = (<br />
\textbf{F}_{x} - \alpha \sin \theta,<br />
\textbf{F}_{y} + \alpha \cos \theta,<br />
)</math><br />
<br />
<math>D = (<br />
\textbf{E}_{x} - \alpha \sin \theta,<br />
\textbf{E}_{y} + \alpha \cos \theta,invariance<br />
)</math><br />
<br />
The last step of this stage is that the images were converted from RGB to YUV, common in image processing pipelines.<br />
<br />
== Feature Extraction ==<br />
<br />
The goal of the feature extraction is to reduce the input dimensionality and increase the generality of the resulting classifier to be trained. Instead of using hand-tuned feature list, <ref name="hadsell2009" /> used a data driven approach and trained 4 different feature extractors, this is the only component of the learning process where it is trained off-line.<br />
<br />
<ul><br />
<li><p>'''Radial Basis Functions (RBF)''': A set of RBF were learned to form a feature vector by calculating the Euclidean distance between input window and each of the 100 RBF centers. Where each feature vector <math>D</math> has the form:</p><br />
<p><math>D_{j} = exp(-\beta^{i} || X - K^{i} ||^{2}_{2})</math></p><br />
<p>Where <math>\beta^{i}</math> is the inverse variance of the RBF center <math>K^{i}</math>, <math>X</math> is the input window, <math>K</math> is the set of <math>n</math> radial basis centers <math>K = \{K^{i} | i = 1 \dots n\}</math>.</p></li><br />
<li><p>'''Convolution Neural Network (CNN)''': A standard CNN was used, the architecture consisted of two layers, the first has 20 7x6 filters and the second has 369 6x5 filters. During training a 100 fully connected hidden neuron layer is added as a last layer to train with 5 outputs. Once the network is trained however that last layer was removed, and thus the resulting CNN outputs a 100 component feature vector. For training the authors random initialized the weights, used stochastic graident decent for 30 epochs, and <math>L^2</math> regularization. The network was trained against 450,000 labeled image patches, and tested against 50,000 labeled patches.</p></li><br />
<li><p>'''Supervised and Unsupervised Auto-Encoders''': Auto-Encoders or Deep Belief Networks <ref name="hinton2006">Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.</ref> <ref name="ranzato2007">Ranzato, Marc Aurelio, et al. "Unsupervised learning of invariant feature hierarchies with applications to object recognition." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.</ref> is a layer wise training procedure. The deep belief net trained has 3 layers, where the first and third are convolutional layers, and the second one is a maxpool layer, the architecture is explained in figure below.</p><br />
[[Image:convo_arch.png|frame| center | 400px | alt=|Convolution Neural Network <ref name="hadsell2009" /> <span data-label="fig:convoarch"></span>]]<br />
<br />
<p>For training, the loss function is mean squared loss. At first the network is trained with 10,000 unlabeled images (unsupervised training) with varying outdoor settings (150 settings), then the network is fined tuned with labeled dataset (supervised training), the authors did not mention how large the labeled dataset was, and what training parameters were used for the supervised stage.</p></li></ul><br />
<br />
== Stereo Supervisor Module ==<br />
<br />
[[Image:ground_plane_estimation.png|frame| center | 400px | alt=|Ground Plane Estimation <ref name="hadsell2009" /> <span data-label="fig:gplanes"></span>]]<br />
<br />
Once the images have been preprocessed and normalized, stereo vision algorithms are used to produce data samples and labels that are “visually consistent, error free and well distributed”. There are 4 steps at this stage:<br />
<br />
<ol><br />
<li><p><span>'''3D point cloud''': Step one, a 3D point cloud is produced by using the Triclops stereo vision algorithm from Point Grey Research. The algorithm used has a range of 12 to 15 meters and works by triangulating objects between two images to find the depth.</span></p></li><br />
<li><p>'''Estimation of ground plane''': Secondly a ground plane model is found by using a combination of Hough transform and principle component analysis (PCA) to fit a plane onto the point cloud <math>S = \{ (x^{i}, y^{i}, z^{i}) | i = 1 \dots n) \} </math>. Where <math>x^{i}, y^{i}, z^{i}</math> defines the position of point relative to the robot’s center, and <math>n</math> is the number of points in the point cloud.</p><br />
<p>The rational behind using Hough transform is since multiple ground planes can be found (see figure above), a voting system was introduced where by the parameter vector which denotes the ground plane parameter (such as pitch, roll and offset) and has the most votes is used. It is selected by the following equation:</p><br />
<p><math>X = P_{ijk} | i, j, k = argmax_{i,j,k} (V_{ijk})</math></p><br />
<p>Where <math>X</math> is the new plane estimate, <math>V</math> is a tensor that accumulates the votes and <math>P</math> is a tensor that records the plane parameter space. Then PCA is used to refit and compute the eigenvalue decomposition of the covariance matrix of the points <math>X^{1 \dots n}</math>.</p><br />
<p><math>\frac{1}{n} \sum^{n}_{1} X^{i} X^{i'} = Q \Lambda Q</math></p><br />
<p>It should be noted, however, that multiple ground planes does not eliminate all errors from the labeling process. The authors of this paper used the following heuristics to minimize the errors in the training data. The heuristic is and I quote:<br />
<br />
<blockquote><br />
{{Quote|text=}} <ref name="hadsell2009" /><br />
</blockquote><br />
<br />
<li><span>'''Projection''': Stereo vision has the limitation of only being able to robustly detect short range (12m max) objects. In an attempt to mitigate the uncertainty of long range objects, footlines of obstacles (the bottom outline of the obstacle) are used. This gives stereo vision better estimates about the scale and distance of long range objects. The footline of long range objects are found by projecting obstacle points onto the ground planes and marking high point-density regions.</span></p></li><br />
<li><p>'''Labeling''': Once the ground plane estimation, footline projections and obstacle points are found, ground map <math>G</math>, footline-map <math>F</math> and obstacle-map <math>O</math> can be produced.</p><br />
<p>Conventionally binary classifiers are used for terrain traversability, however, used a classifier that uses 5 labels:</p><br />
<ul><br />
<li><p><span>Super-traversable</span></p></li><br />
<li><p><span>Ground</span></p></li><br />
<li><p><span>Footline</span></p></li><br />
<li><p><span>Obstacle</span></p></li><br />
<li><p><span>Super-obstacle</span></p></li></ul><br />
<br />
[[Image:label_categories.png|frame| center | 400px | alt=|Label Categories <ref name="hadsell2009" /> <span data-label="fig:labelcategories"></span>]]<br />
<br />
<p>Where super-traversable and super-obstacle are high confidence labels that refer to input windows where only ground or obstacles are seen. Lower confidence labels such as ground and obstacle are used when there are mixture of points in the input window. Lastly footline labels are assigned when footline points are centered in the middle of the input window. The label criteria rules used by <ref name="hadsell2009" /> are outlined in figure below</p><br />
[[Image:label_criteria.png|frame| center | 400px | alt=|Label Criteria Rules <ref name="hadsell2009" /> <span data-label="fig:labelcriteria"></span>]]<br />
</li></ol><br />
<br />
== Training and Classification ==<br />
<br />
The real-time classifier is the last stage of the learning process. Due to its real-time nature the classifier has to be simple and efficient, therefore 5 logistic regression classifiers (one for each category) with a Kullback-Liebler divergence or relative entropy loss function and stochastic gradient descent was used. Additionally 5 ring buffer or circular buffer are used to store incoming data from the feature extraction and stereo supervisor. The ring buffer acts as a First In First Out (FIFO) queue and stores temporary data as it is being received and processed. The result is that the classifiers outputs a 5 component likelihood vectors for each input.<br />
<br />
= Experimental Results =<br />
<br />
== Performances of Feature Extractors ==<br />
<br />
[[Image:feature_extractors.png|frame| center | 400px | alt=|Comparision of Feature Extractors <ref name="hadsell2009" /> <span data-label="fig:featureextractors"></span>]]<br />
<br />
For testing the feature extractors, a dataset containing 160 hand labeled frames from over 25 log files were used, the log files can be further divided into 7 groups as seen in figure above, where it is a comparision of the 4 different feature extractors: Radial Basis Functions, Convolutional Neural Network, an Unsupervised Auto-Encoder and finally a supervised Auto-Encoder. In almost all cases it can be observed that the best feature extractor was the CNN trained with Auto-Encoders with the best average error rate of <math>8.46\%</math>.<br />
<br />
== Performances of Stereo Supervisor Module ==<br />
<br />
[[Image:stereo_module_comparison.png|frame| center | 400px | alt=|Stereo Module Performance <ref name="hadsell2009" /> <span data-label="fig:stereomodulecomparison"></span>]]<br />
<br />
To test the stereo module it was compared against the online classifier using the same ground truth dataset used in the previous section. As you can see from figure above the online classifier performs better than the stereo supervisor module, the authors note that it is due to the online classifier ability to smooth and regularize the noisy data <ref name="hadsell2009" />.<br />
<br />
== Field Test ==<br />
<br />
The online classifier was deployed onto a Learning Applied to Ground Robots (LAGR) vehicle provided by the National Robotics Engineering Center (NREC), and tested on three different courses. The system contains 2 processes running simultaneously, a 1-2 Hz online classifier outlined above, and a fast 8 - 10 Hz stereo based obstacle avoidance module. The combination of the both provides good long range and short range obstacle capabilities.<br />
<br />
The system was found to be most effective when long-range online classifier was combined with the short range module, as the short range only has a range of around 5 meters it often required human intervention to rescue the vehicle. No quantitative comparisons were given for these field tests, it is purely subjective and only tested during daytime.<br />
<br />
= Conclusion =<br />
<br />
This paper did not introduce novel ideas per se in terms of deep learning methods, however the application of deep learning methods (CNN + auto-encoders) along with stereo module to train a 5 label classifier shows great promise in increasing the road classification from a max range of 10 - 12 meters with purely stereo vision to over 100 meters is new in 2009 <ref name="hadsell2009" />.<br />
<br />
There were several issues with the experiments I have observed:<br />
<br />
* <span>There were no mention how many times the feature extractors were trained to obtain best parameters, nor the difficulty in training.</span><br />
* <span>All data and tests were performed during daytime, no mention of limitations at night.</span><br />
* <span>This paper did not compare itself against other state of the art systems such as <ref name="hong2002" /> <ref name="lieb2005" /> <ref name="dahlkamp2006" /> other than stereo vision based systems.</span><br />
* <span>In the plot of stereo vision vs online classifier did not contain error bars. Also on the x-axis the groundtruth frames are ordered by error difference, it would be interesting to see what would happen if it was time ordered instead, and whether it would tell us that stereo vision performs well at the beginning but poorly afterwards, supporting the authors claim that an online classifier is able to smooth and regularize the noisy data.</span><br />
* <span>Field tests lacked a quantitative measures to compare between the long range system against the short range system.</span><br />
<br />
= References =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27353deep Sparse Rectifier Neural Networks2015-12-18T03:11:52Z<p>Rqiao: /* Biological Plausibility and Sparsity */</p>
<hr />
<div>= Introduction =<br />
<br />
Machine learning scientists and computational neuroscientists deal with neural networks differently. Machine learning scientists aim to obtain models that are easy to train and easy to generalize, while neuroscientists' objective is to produce useful representation of the scientific data. In other words, machine learning scientists care more about efficiency, while neuroscientists care more about interpretability of the model.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery mode=packed widths="280px" heights="250px"><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability (because the input is represented in a higher-dimensional space) and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Advantages of rectified linear units ==<br />
<br />
The rectifier activation function <math>\,max(0, x)</math> allows a network to easily obtain sparse representations since only a subset of hidden units will have a non-zero activation value for some given input and this sparsity can be further increased through regularization methods. Therefore, the rectified linear activation function will utilize the advantages listed in the previous section for sparsity.<br />
<br />
For a given input, only a subset of hidden units in each layer will have non-zero activation values. The rest of the hidden units will have zero and they are essentially turned off. Each hidden unit activation value is then composed of a linear combination of the active (non-zero) hidden units in the previous layer due to the linearity of the rectified linear function. By repeating this through each layer, one can see that the neural network is actually an exponentially increasing number of linear models who share parameters since the later layers will use the same values from the earlier layers. Since the network is linear, the gradient is easy to calculate and compute and travels back through the active nodes without vanishing gradient problem caused by non-linear sigmoid or tanh functions. In addition to the standard one, three versions of ReLUthat are modified as: Leaky, Parametric, and Randomized leaky ReLU. <br />
<br />
The sparsity and linear model can be seen in the figure the researchers made:<br />
<br />
[[File:RLU.PNG]]<br />
<br />
Each layer is a linear combination of the previous layer.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used. Also, if symmetry is required, this can be obtained by using two rectifier units with shared parameters, but requires twice as many hidden units as a network with a symmetric activation function.<br />
<br />
Finally, rectifier networks are subject to ill conditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.<br />
<br />
This paper addresses several difficulties when one wants to use rectifier activation into stacked denoising auto-encoder. The author have experienced several strategies to try to solve these problem.<br />
<br />
1. Use a softplus activation function for the reconstruction layer, along with a quadratic cost: <math> L(x, \theta) = ||x-log(1+exp(f(\tilde{x}, \theta)))||^2</math><br />
<br />
2. scale the rectifier activation values between 0 and 1, then use a sigmoid activation function for the reconstruction layer, along with a cross-entropy reconstruction cost. <math> L(x, \theta) = -xlog(\sigma(f(\tilde{x}, \theta))) - (1-x)log(1-\sigma(f(\tilde{x}, \theta))) </math><br />
<br />
The first strategy yield better generalization on image data and the second one on text data.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
For image recognition task, they find that there is almost no improvement when using unsupervised pre-training with rectifier activations, contrary to what is experienced using tanh or softplus. However, it achieves best performance when the network is trained Without unsupervised pre-training.<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=27347deep Convolutional Neural Networks For LVCSR2015-12-17T03:51:35Z<p>Rqiao: /* Conclusions and Discussions */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
A slight improvement can be obtained by using 128 hidden units for the first convolutional layer and 256 for the second layer, which uses more hidden units in the convolutional layers, as many hidden units are needed to capture the locality differences between different frequency regions in speech.<br />
<br />
== Optimal Feature Set ==<br />
We should note that the Linear Discriminant Analysis (LDA) cannot be used with CNNs because it removes local correlation in frequency. So they use Mel filter-bank (FB) features which exhibit this locality property.<br />
<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The optimal architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
Broadcast News consists of 400 hours of speech data and it was used for training. DARPA EARS rt04 and def04f datasets were used for testing. The following table shows that CNN-based features offer 13-18% relative improvment over GMM/HMM system and 10-12% over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
<br />
== Switchboard ==<br />
<br />
Switchboard dataset is a 300 hours of conversational American English telephony data. Hub5'00 dataset is used as validation set, while rt03 set is used for testing. Switchboard (SWB) and Fisher (FSH) are portions of the set, and the results are reported separately for each set. Three systems, as shown in the following table, were used in comparisons. CNN-based features over 13-33% relative improvement over GMM/HMM system, and 4-7% relative improvement over hybrid DNN system. These results show that CNNs are superior to both GMMs and DNNs.<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
<br />
= Conclusions and Discussions =<br />
<br />
This paper demonstrates that CNNs perform well for LVCSR and shows that multiple convolutional layers gives even more improvement when the convolutional layers have a large number of feature maps. In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems. Also, the Mel filter-bank is regarded as a suitable feature for the CNN since it exhibits this locality property.<br />
In fact, CNN’s are able to capture translational invariance for different speakers with by replicating weights in time and frequency domain, and they can model local correlations of speech.<br />
<br />
In this paper, the authors draw the conclusion that having 2 convolutional and 4 fully connected layers is optimal for CNNs. But from previous table we can see the result for 2 convolutional and 4 fully connected layers is close to the result of 3 convolutional and 3 fully connected layers. More experiments and assumptions may be needed to draw this conclusion statistically.<br />
<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Convolutional_Neural_Networks_For_LVCSR&diff=27346deep Convolutional Neural Networks For LVCSR2015-12-17T03:40:29Z<p>Rqiao: /* Conclusions and Discussions */</p>
<hr />
<div>= Introduction =<br />
Deep Neural Networks (DNNs) have been explored in the area of speech recognition. They outperformed the-state-of-the-art Gaussian Mixture Models-Hidden Markov Model (GMM-HMM) systems in both small and large speech recognition tasks<br />
<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref><br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref><br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref><br />
<ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref><br />
<ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>.<br />
Convolutional Neural Networks (CNNs) can model temporal/spacial variations while reduce translation variances. CNNs are attractive in the area of speech recognition for two reasons: first, they are translation invariant which makes them an alternative to various speaker adaptation techniques. Second, spectral representation of the speech has strong local correlations, CNN can naturally capture these type of correlations.<br />
<br />
CNNs have been explored in speech recognition <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>, but only one convolutional layer was used. This paper explores using multiple convolutional layers, and the system is tested on one small dataset and two large datasets. The results show that CNNs outperform DNNs in all of these tasks.<br />
<br />
= CNN Architecture =<br />
<br />
A typical CNN, as shown in Fig 1, consists of a convolutional layer for which the weights are shared across the input space, and a max-poolig layer.<br />
<br />
<center><br />
[[File:Convnets.png |300px | thumb | center |Fig 1. A typical convolutional neural network. ]]<br />
</center><br />
<br />
== Experimental Setup ==<br />
A small 40-hour dataset is used to learn the behaviour of CNNs for speech tasks. The results are reported on EARS dev04f dataset. Features of 40-dimentional log mel-filter bank coeffs are used. The size of the hidden fully connected layer is 1024, and the softmax layer size is 512. For fine-tuning, the learning rate is halved after each iteration for which the objective function doesn't improve sufficiently on a held-out validation set. After 5 times of halving the learning rate, the training stops. <br />
<br />
== Number of Convolutional vs. Fully Connected Layers ==<br />
In image recognition tasks, a few convolutional layers are used before fully connected layers. These convolutional layers tend to reduce spectral varitaion, while fully connected layers use the local information learned by the the convolutional layers to do classification. In this work and unlike what have been explored before for speech recognition tasks <ref name=convDNN></ref>, multiple convolutional layers are used followed by fully connected layers similar to image recognition framework. The following table shows the word error rate (WER) for different number of convolutional and fully connected layers.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of convolutional and fully-connected layers.<br />
! Number of convolutional and fully-connected layers<br />
! WER<br />
|-<br />
| No conv, 6 full<br />
| 24.8<br />
|-<br />
| 1 conv, 5 full<br />
| 23.5<br />
|-<br />
| 2 conv, 4 full<br />
| 22.1<br />
|-<br />
| 3 conv, 3 full<br />
| 22.4<br />
|-<br />
|}<br />
<br />
== Number of Hidden Units ==<br />
Speech is different than images in the sense that different frequencies have different features, hence Osama et. al. <ref name=convDNN></ref> proposed to have weight sharing across nearby frequencies only. Although this solves the problem, it limits adding multiple convolutional layers. In this work, weights sharing is done across the entire feature space while using more filters - compared to vision - to capture the differences in the low and high frequencies.<br />
The following table shows the WER for different number of hidden units for convolutional layers for 2 convolutional and 4 fully-connected configuration. The parameters of the network is kept constant for fair comparisons.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the number of hidden units.<br />
! Number of hidden units<br />
! WER<br />
|-<br />
| 64<br />
| 24.1<br />
|-<br />
| 128<br />
| 23.0<br />
|-<br />
| 220<br />
| 22.1<br />
|-<br />
| 128/256<br />
| 21.9<br />
|-<br />
|}<br />
<br />
A slight improvement can be obtained by using 128 hidden units for the first convolutional layer and 256 for the second layer, which uses more hidden units in the convolutional layers, as many hidden units are needed to capture the locality differences between different frequency regions in speech.<br />
<br />
== Optimal Feature Set ==<br />
We should note that the Linear Discriminant Analysis (LDA) cannot be used with CNNs because it removes local correlation in frequency. So they use Mel filter-bank (FB) features which exhibit this locality property.<br />
<br />
The following features are used to build the table below, WER is used to decide the best set of features to be used.<br />
# Vocal Tract Length Normalization (VTLN)-warping to help map features into a canonical space.<br />
# feature space Maximum Likelihood Linear Regression (fMLLR).<br />
# Delte (d) which is the difference between features in consecutive frames and double delta (dd).<br />
# Energy feature.<br />
<br />
{| class="wikitable"<br />
|+ Word error rate as a function of input features.<br />
! Feature<br />
! WER<br />
|-<br />
| Mel FB<br />
| 21.9<br />
|-<br />
| VTLN-warped mel FB<br />
| 21.3<br />
|-<br />
| VTLN-warped mel FB + fMLLR<br />
| 21.2<br />
|-<br />
| VTLN-warped mel FB + d + dd<br />
| 20.7<br />
|-<br />
| VTLN-warped mel FB + d + dd + energy<br />
| 21.0<br />
|-<br />
|}<br />
<br />
== Pooling Experiments ==<br />
Pooling helps with reducing spectral variance in the input features. The pooling is done only on the frequency domain which was shown to be working better for speech <ref name=convDNN></ref>. The word error rate is tested on two different dataset with two different sampling rates (8khz switchboard telephone conversations SWB and 16khz English Broadcast news BN), and the pooling size of 3 is found to be the optimal size.<br />
{| class="wikitable"<br />
|+ Word error rate as a function of the pooling size.<br />
! Pooling size<br />
! WER-SWB<br />
! WER-BN<br />
|-<br />
| No pooling<br />
| 23.7<br />
| '-'<br />
|-<br />
| pool=2<br />
| 23.4<br />
| 20.7<br />
|-<br />
| pool=3<br />
| 22.9<br />
| 20.7<br />
|-<br />
| pool=4<br />
| 22.9<br />
| 21.4<br />
|-<br />
|}<br />
<br />
= Results with Proposed Architecture =<br />
<br />
The optimal architecture described in the previous section is used in the experiments. A 50-hr English Broadcast News (BN) dataset is used for training and EARS dev04f and rt04 datasets are used for testing. Five different systems are used for comparisons as shown in the following table. The hybrid approach means that either the DNN or CNN is used to produce the likelihood probabilities for the HMM. While CNN/DNN-based features means that CNN or DNN were used to produce features to be used by the GMM/HMM system. We can see that using Hybrid CNN offers 15% relative improvement over GMM-HMM system, and 3-5% relative improvement over Hybrid DNN. Also CNN-based feature offers 5-6% relative improvement over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER for NN Hybrid and Feature-Based Systems.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 18.8<br />
| 18.1<br />
|-<br />
| Hybrid DNN<br />
| 16.3<br />
| 15.8<br />
|-<br />
| DNN-based features<br />
| 16.7<br />
| 16.0<br />
|-<br />
| Hybrid CNN<br />
| 15.8<br />
| 15.0<br />
|-<br />
| CNN-based features<br />
| 15.2<br />
| 15.0<br />
|-<br />
|}<br />
<br />
= Results on Large Tasks =<br />
<br />
After tuning the CNN configuration on a small dataset, the CNN-based features system is tested on two larger datasets.<br />
<br />
== Broadcast News ==<br />
Broadcast News consists of 400 hours of speech data and it was used for training. DARPA EARS rt04 and def04f datasets were used for testing. The following table shows that CNN-based features offer 13-18% relative improvment over GMM/HMM system and 10-12% over DNN-based features.<br />
<br />
{| class="wikitable"<br />
|+ WER on Broadcast News, 400 hrs.<br />
! Model<br />
! dev04f<br />
! rt04<br />
|-<br />
| Baseline GMM/HMM <br />
| 16.0<br />
| 13.8<br />
|-<br />
| Hybrid DNN<br />
| 15.1<br />
| 13.4<br />
|-<br />
| DNN-based features<br />
| 14.9<br />
| 13.4<br />
|-<br />
| CNN-based features<br />
| 13.1<br />
| 12.0<br />
|-<br />
|}<br />
<br />
== Switchboard ==<br />
<br />
Switchboard dataset is a 300 hours of conversational American English telephony data. Hub5'00 dataset is used as validation set, while rt03 set is used for testing. Switchboard (SWB) and Fisher (FSH) are portions of the set, and the results are reported separately for each set. Three systems, as shown in the following table, were used in comparisons. CNN-based features over 13-33% relative improvement over GMM/HMM system, and 4-7% relative improvement over hybrid DNN system. These results show that CNNs are superior to both GMMs and DNNs.<br />
{| class="wikitable"<br />
|+ WER on Switchboard, 300 hrs.<br />
! Model<br />
! Hub5’00 SWB<br />
! rt03 FSH<br />
! rt03 SWB<br />
|-<br />
| Baseline GMM/HMM <br />
| 14.5<br />
| 17.0<br />
| 25.2<br />
|-<br />
| Hybrid DNN<br />
| 12.2<br />
| 14.9<br />
| 23.5<br />
|-<br />
| CNN-based features<br />
| 11.5<br />
| 14.3<br />
| 21.9<br />
|-<br />
|}<br />
<br />
= Conclusions and Discussions =<br />
<br />
This paper demonstrates that CNNs perform well for LVCSR and shows that multiple convolutional layers gives even more improvement when the convolutional layers have a large number of feature maps. In this work, using CNNs was explored and it was shown that they are superior to both GMMs and DNNs on a small speech recognition task. CNNs were used to produce features for the GMMs, the performance of this system is tested on larger datasets and it outperformed both the GMM and DNN based systems. Also, the Mel filter-bank is regarded as a suitable feature for the CNN since it exhibits this locality property.<br />
In fact, CNN’s are able to capture translational invariance for different<br />
speakers with by replicating weights in time and frequency domain, and they can model local correlations of speech.<br />
<br />
The authors setup the experiments without clarifying the following:<br />
# Hybrid CNN wasn't tested on larger dataset, the authors didn't give a reason for that and it might be due to a scalability issues.<br />
# They didn't compare to the CNN system proposed by Osama et. al. <ref name=convDNN></ref>.<br />
<br />
= References =<br />
<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27345very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-17T03:27:53Z<p>Rqiao: /* Training */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper<ref><br />
Simonyan, Karen, and Andrew Zisserman. [http://arxiv.org/pdf/1409.1556.pdf "Very deep convolutional networks for large-scale image recognition."] arXiv preprint arXiv:1409.1556 (2014).</ref> the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the only preprocessing step is to subtract the mean RBG value computed on the training data. Then, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3 with a convolutional stride of 1 pixel. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
= Classification Framework =<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
===Training===<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively fewer epochs to converge due to the following reasons:<br />
(a) implicit regularization imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialization of certain layers.<br />
<br />
With respect to (b) above, the shallowest configuration (A in the previous table) was trained using random initialization. For all the other configurations, the first four convolutional layers and the last 3 fully connected layers were initialized with the corresponding parameters from A, to avoid getting stuck during training due to a bad initialization. All other layers were randomly initialized by sampling from a normal distribution with 0 mean. The author also mentioned that they find it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio<ref><br />
Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." International conference on artificial intelligence and statistics. 2010.<br />
</ref><br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
===Implementation===<br />
<br />
To improve overall training speed of each model, the researchers introduced parallelization to the mini batch gradient descent process. Since the model is very deep, training on a single GPU would take months to finish. To speed up the process, the researchers trained separate batches of images on each GPU in parallel to calculate the gradients. For example, with 4 GPUs, the model would take 4 batches of images, calculate their separate gradients and then finally take an average of four sets of gradients as training. (Krizhevsky et al., 2012) introduced more complicated ways to parallelize training convolutional neural networks but the researchers found that this simple configuration speed up training process by a factor of 3.75 with 4 GPUs and with a possible maximum of 4, the simple configuration worked well enough. <br />
Finally, it took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
===Testing===<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
= Classification Experiments =<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
== Single-Scale Evaluation ==<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
== Multi-Scale Evaluation ==<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Comparison With The State Of The Art ==<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
= Appendix A: Localization =<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Details and more results on these competitions can be found here.<ref><br />
Russakovsky, Olga, et al. [http://arxiv.org/pdf/1409.0575v3.pdf "Imagenet large scale visual recognition challenge."] International Journal of Computer Vision (2014): 1-42.<br />
</ref> They also showed that their configuration is applicable to some other datasets.<br />
<br />
= Resources =<br />
<br />
The Oxford Visual Geometry Group (VGG) has released code for their 16-layer and 19-layer models. The code is available on their [http://www.robots.ox.ac.uk/~vgg/research/very_deep/ website] in the format used by the [http://caffe.berkeleyvision.org/ Caffe] toolbox and includes the weights of the pretrained networks.<br />
<br />
=References=<br />
<references /><br />
<br />
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27344very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-17T03:27:32Z<p>Rqiao: /* Training */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper<ref><br />
Simonyan, Karen, and Andrew Zisserman. [http://arxiv.org/pdf/1409.1556.pdf "Very deep convolutional networks for large-scale image recognition."] arXiv preprint arXiv:1409.1556 (2014).</ref> the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the only preprocessing step is to subtract the mean RBG value computed on the training data. Then, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3 with a convolutional stride of 1 pixel. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
= Classification Framework =<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
===Training===<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively fewer epochs to converge due to the following reasons:<br />
(a) implicit regularization imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialization of certain layers.<br />
<br />
With respect to (b) above, the shallowest configuration (A in the previous table) was trained using random initialization. For all the other configurations, the first four convolutional layers and the last 3 fully connected layers were initialized with the corresponding parameters from A, to avoid getting stuck during training due to a bad initialization. All other layers were randomly initialized by sampling from a normal distribution with 0 mean. The author also mentioned that they find it is possible to o initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio<ref><br />
Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." International conference on artificial intelligence and statistics. 2010.<br />
</ref><br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
===Implementation===<br />
<br />
To improve overall training speed of each model, the researchers introduced parallelization to the mini batch gradient descent process. Since the model is very deep, training on a single GPU would take months to finish. To speed up the process, the researchers trained separate batches of images on each GPU in parallel to calculate the gradients. For example, with 4 GPUs, the model would take 4 batches of images, calculate their separate gradients and then finally take an average of four sets of gradients as training. (Krizhevsky et al., 2012) introduced more complicated ways to parallelize training convolutional neural networks but the researchers found that this simple configuration speed up training process by a factor of 3.75 with 4 GPUs and with a possible maximum of 4, the simple configuration worked well enough. <br />
Finally, it took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
===Testing===<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
= Classification Experiments =<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
== Single-Scale Evaluation ==<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
== Multi-Scale Evaluation ==<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Comparison With The State Of The Art ==<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
= Appendix A: Localization =<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Details and more results on these competitions can be found here.<ref><br />
Russakovsky, Olga, et al. [http://arxiv.org/pdf/1409.0575v3.pdf "Imagenet large scale visual recognition challenge."] International Journal of Computer Vision (2014): 1-42.<br />
</ref> They also showed that their configuration is applicable to some other datasets.<br />
<br />
= Resources =<br />
<br />
The Oxford Visual Geometry Group (VGG) has released code for their 16-layer and 19-layer models. The code is available on their [http://www.robots.ox.ac.uk/~vgg/research/very_deep/ website] in the format used by the [http://caffe.berkeleyvision.org/ Caffe] toolbox and includes the weights of the pretrained networks.<br />
<br />
=References=<br />
<references /><br />
<br />
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=27038deep Learning of the tissue-regulated splicing code2015-12-01T19:21:33Z<p>Rqiao: /* Training the model */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math> <br />
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math><br />
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math><br />
::::::: this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Training the model =<br />
<br />
The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with backpropagation method. The DNN weights were initialized with small random values sampled from a zero-mean Gaussian distribution with 50% dropout rate for all layers except for the input layer. <br />
<br />
In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples. <br />
<br />
Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. <br />
<br />
The targets consist of (i) PSI for each of the two tissues and (ii) <math> \Delta PSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \Delta PSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease.<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=27037deep Learning of the tissue-regulated splicing code2015-12-01T19:19:28Z<p>Rqiao: /* Training the model */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math> <br />
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math><br />
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math><br />
::::::: this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Training the model =<br />
<br />
The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with backpropagation method. The DNN weights were initialized with small random values sampled from a zero-mean Gaussian distribution with 50% dropout rate for all layers except for the input layer. <br />
<br />
In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples. <br />
<br />
Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. <br />
<br />
The targets consist of (i) PSI for each of the two tissues and (ii) <math> \delta_{PSI} </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \delta_{PSI} </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease.<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=27036deep Learning of the tissue-regulated splicing code2015-12-01T19:19:08Z<p>Rqiao: /* Training the model */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math> <br />
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math><br />
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math><br />
::::::: this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Training the model =<br />
<br />
The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with backpropagation method. The DNN weights were initialized with small random values sampled from a zero-mean Gaussian distribution with 50% dropout rate for all layers except for the input layer. <br />
<br />
In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples. <br />
<br />
Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. <br />
<br />
The targets consist of (i) PSI for each of the two tissues and (ii) <math> \delta_PSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \delta_PSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease.<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=27035deep Learning of the tissue-regulated splicing code2015-12-01T19:18:28Z<p>Rqiao: </p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math> <br />
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math><br />
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math><br />
::::::: this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Training the model =<br />
<br />
The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with backpropagation method. The DNN weights were initialized with small random values sampled from a zero-mean Gaussian distribution with 50% dropout rate for all layers except for the input layer. <br />
<br />
In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples. <br />
<br />
Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. <br />
<br />
The targets consist of (i) PSI for each of the two tissues and (ii) <math> \deltaPSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \deltaPSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease. <br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26603dropout2015-11-19T03:33:15Z<p>Rqiao: /* Applying dropout to linear regression */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in \mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
</math><br />
<br />
This reduce to <br />
<br />
<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2<br />
</math><br />
<br />
where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26602dropout2015-11-19T03:32:48Z<p>Rqiao: /* Applying dropout to linear regression */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
</math><br />
<br />
This reduce to <br />
<br />
<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2<br />
</math><br />
<br />
where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26601dropout2015-11-19T03:29:09Z<p>Rqiao: /* Applying dropout to linear regression */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
</math><br />
<br />
This reduce to <br />
<br />
<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2<br />
</math><br />
<br />
where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math><br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26600dropout2015-11-19T03:24:14Z<p>Rqiao: /* Model */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=26585f15Stat946PaperSignUp2015-11-19T02:58:37Z<p>Rqiao: /* Set B */</p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/pdf/10.1021/ci500747n paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||Learning Fast Approximations of Sparse Coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || 54 || Semi-supervised Learning with Deep Generative Models || [http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf Paper] || [[Semi-supervised Learning with Deep Generative Models | Summary]]<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]|| [[The Manifold Tangent Classifier|Summary]]<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] || [[Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines|Summary]]<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] || [[Generating text with recurrent neural networks|Summary]]<br />
|-<br />
|Tim Tse|| || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Rui Qiao|| 40 || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| 19 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|-<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]<br />
|-<br />
|Mahmood Gohari||37 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]<br />
|-<br />
|Valerie Platsko|| || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]<br />
|-<br />
|Derek Latremouille|| || The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] || [[The Wake-Sleep Algorithm for Unsupervised Neural Networks | Summary]]<br />
|-<br />
|Ri Wang|| || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]<br />
|-<br />
|Deepak Rishi|| || Extracting and Composing Robust Features with Denoising Autoencoders || [http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf Paper] || [[Extracting and Composing Robust Features with Denoising Autoencoders | Summary]]<br />
|-<br />
|Maysum Panju|| || A fast learning algorithm for deep belief nets || [https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf Paper] || [[A fast learning algorithm for deep belief nets | Summary]]<br />
|-<br />
|Dylan Drover|| 53 || Deep Generative Stochastic Networks Trainable by Backprop || [http://jmlr.org/proceedings/papers/v32/bengio14.pdf Paper] || [[Deep Generative Stochastic Networks Trainable by Backprop| Summary]]</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26397learning Phrase Representations2015-11-17T16:32:41Z<p>Rqiao: /* References */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
= References=<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26396learning Phrase Representations2015-11-17T16:32:13Z<p>Rqiao: </p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
= References=<br />
</references></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26395learning Phrase Representations2015-11-17T16:31:41Z<p>Rqiao: /* Scoring Phrase Pairs with RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26394learning Phrase Representations2015-11-17T16:29:30Z<p>Rqiao: /* Scoring Phrase Pairs with RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model:<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26393learning Phrase Representations2015-11-17T16:25:14Z<p>Rqiao: </p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f}|\mathbf{e})\propto p(\mathbf{e}|\mathbf{f})p(\mathbf{f})</math><br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model:<br />
<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26361learning Phrase Representations2015-11-17T03:06:00Z<p>Rqiao: /* Experiments */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Encdec3.png&diff=26360File:Encdec3.png2015-11-17T02:58:40Z<p>Rqiao: </p>
<hr />
<div></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26359learning Phrase Representations2015-11-17T02:58:15Z<p>Rqiao: </p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. The result is compared with baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26356learning Phrase Representations2015-11-17T02:43:58Z<p>Rqiao: /* Hidden Unit that Adaptively Remembers and Forgets */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26355learning Phrase Representations2015-11-17T02:37:47Z<p>Rqiao: /* RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Encdec2.png&diff=26348File:Encdec2.png2015-11-17T02:16:23Z<p>Rqiao: </p>
<hr />
<div></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26347learning Phrase Representations2015-11-17T02:15:57Z<p>Rqiao: /* RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26325learning Phrase Representations2015-11-17T01:37:44Z<p>Rqiao: /* RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where<math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26324learning Phrase Representations2015-11-17T01:32:56Z<p>Rqiao: /* RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(y_n|x_n) </math> <br/></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26322learning Phrase Representations2015-11-17T01:22:26Z<p>Rqiao: /* RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26321learning Phrase Representations2015-11-17T01:16:09Z<p>Rqiao: /* RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary c of the whole input sequence.<br />
The hidden state of the decoder at time t is computed by<br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26318learning Phrase Representations2015-11-17T00:45:10Z<p>Rqiao: </p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
<center><br />
[[File:encdec1.png | frame | center |Fig 1. Comparison of linear convolution layer and mlpconv layer ]]<br />
</center></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Encdec1.png&diff=26317File:Encdec1.png2015-11-17T00:44:47Z<p>Rqiao: </p>
<hr />
<div></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=26316learning Phrase Representations2015-11-17T00:43:26Z<p>Rqiao: Created page with "= Introduction = In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encode..."</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Machine_Translation:_Jointly_Learning_to_Align_and_Translate&diff=26198neural Machine Translation: Jointly Learning to Align and Translate2015-11-13T02:42:46Z<p>Rqiao: /* Aligment */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.<br />
<br />
In this paper, for the activation function of an RNN, the gated hidden unit is used which is similar to a long short-term memory (LSTM), but is able to better maintain contextual information from early until late in a sentence.Additionally, in the introduced method, the encoder assigns a context-dependent vector, or annotation, to every source word. The decoder then selectively combines the most relevant annotations to generate each target word; this implements a mechanism of attention in the decoder.<br />
<br />
= Previous methods =<br />
<br />
In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.<br />
<br />
In machine translation, the problem at hand is to identify the target sentence <math>y</math> (in natural language <math>B</math>) that is the most likely corresponding translation to the source sentence <math>x</math> (in natural language <math>A</math>). The authors compactly summarize this problem using the formula <math> \arg\max_{y} P(y|x)</math>.<br />
<br />
Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom<ref><br />
Kalchbrenner N, Blunsom P. Recurrent Continuous Translation Models[C]//EMNLP. 2013: 1700-1709.<br />
</ref>, Cho et al.<ref><br />
Cho K, van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. arXiv preprint arXiv:1409.1259, 2014.<br />
</ref>, Sutvesker et al.<ref><br />
Sutskever I, Vinyals O, Le Q V V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.<br />
</ref> has built a neural machine translation to directly learn the conditional probability distribution between input <math>x</math> and output <math>y</math>. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.<br />
<br />
<br />
== Encoding ==<br />
<br />
Typically, the encoding step iterates through the input vectors in the representation of source sentence <math>x</math> and updates a hidden state with each new token in the input: <math>h_t = f(x_t, h_{t-1})</math>, for some nonlinear function <math>f</math>. After the entire input is read, the resulting fixed-length representation of the entire input sentence <math>x</math> is given by a nonlinear function <math>q</math> of all of the hidden states: <math>c = q(\{h_1, \ldots, h_{T_x}\})</math>. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.<br />
<br />
== Decoding == <br />
<br />
Decoding the fixed-length representation <math>c</math> of <math>x</math> is done by predicting one token of the target sentence <math>y</math> at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities <math>P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)</math>. <br />
<br />
In the neural network approach, the conditional probability of the next output term given the previous ones <math>P(y_t | \{y_1, \ldots, y_{t-1}\},c)</math> is given by the evaluation of a nonlinear function <math>g(y_{t-1}, s_t, c)</math>, where <math>s_t</math> is the hidden state of the RNN.<br />
<br />
= The proposed method =<br />
<br />
The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector <math>c</math> altogether, and instead aligns the tokens of the translated sentence <math>y</math> directly with the corresponding tokens of source sentence <math>x</math> as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.<br />
<br />
== Encoding ==<br />
<br />
The proposed model does not use an ordinary recursive neural network to encode the target sentence <math>x</math>, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of <math>x</math> in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of <math>x</math> is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector <math>h_j</math> of the token <math>x_j</math> in <math>x</math> is given by the concatenation of these two hidden states vectors.<br />
<br />
== Aligment ==<br />
<br />
An alignment model (in the form of a neural network) is used to measure how well each annotation <math>h_j</math> of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector <math>s_{i-1}</math>, the hidden state of the RNN that identifies the tokens in the output sentence <math>y</math>. This is stored as the energy score <math>e_{ij} = a(s_{i-1}, h_j)</math>. <br />
<br />
The energy scores from the alignment process are used to assign weights <math>\alpha_{ij}</math> to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:<br />
<br />
<math>\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}</math><br />
<br />
The weights are then applied to the annotations to obtain the current context vector input: <br />
<br />
<math>c_i = \Sigma_j \alpha_{ij}h_j</math><br />
<br />
Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector <math>c</math>; rather, every time we translate a new word in the sentence, a new representation vector <math>c_i</math> is produced(though the dimension of <math>c_i</math> is still fixed to 2*n). This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector <math>c_i</math>).<br />
<br />
== Decoding ==<br />
<br />
The decoding is done by using an RNN to model a probability distribution on the conditional probabilities <br />
<br />
<math>P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)</math><br />
<br />
where here, <math>s_i</math> is the RNN hidden state at the previous time step,computed by <br />
<br />
<math>s_i = f (s_{i-1}, y_{i-1}, c_i)</math><br />
<br />
and <math>c_i</math> is the current context vector representation as discussed above under Alignment.<br />
<br />
Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.<br />
<br />
== Experiment Settings == <br />
The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).<br />
Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>. <br />
<br />
The WMT '14 dataset actually contains the following corpora, totaling (850M words):<br />
* Europarl (61M words)<br />
* News Commentary (5.5M words)<br />
* UN (421M words) <br />
* Crawled corpora (90M and 272.5 words)<br />
<br />
This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.<br />
<br />
Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.<br />
<br />
= Results =<br />
<br />
The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50. They used a shortlist of 30,000 most frequent words in each language to train their models and any word not included in the shortlist is mapped to a special token representing unknown word.<br />
<br />
Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.<br />
<br />
The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.<br />
<br />
[[File:RNNsearch_Graph.jpg]]<br />
<br />
Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.<br />
<br />
= Conclusion and Comments =<br />
<br />
Overall, the algorithm proposed by the authors gives a new and seemingly useful approach towards machine translation, particularly for translating long sentences.<br />
<br />
The performance appears to be good, but it would be interesting to see if it can be maintained when translating between languages that are not as closely aligned naturally as English and French usually are. The authors briefly refer to other languages (such as German) but do not provide any experiments or detailed comments to describe how the algorithm would perform in such cases. <br />
<br />
It is also interesting to note that, while the performance was always shown to be better for RNNsearch than for the older RNNencdec model, the former also includes more hidden units overall in its models than the latter. RNNencdec was mentioned as having 1000 hidden units for each of its encoding and decoding RNNs, giving a total of 2000; meanwhile, RNNsearch had 1000 hidden units for each the forward and backward RNNs in encoding, as well as 1000 more for the decoding RNN, giving a total of 3000. This is perhaps a worthy point to take into consideration when judging the relative performance of the two models objectively.<br />
<br />
Compare to some other algorithms, the performance of proposed algorithm for rare words, even in English to French translation is not good enough. For long sentences with large number of rare words the algorithm which uses a deep LSTM to encode the input sequence and a separate deep LSTM to output the translation works more accurate with larger BLEU score. <ref> Sutskever I, Le Q, Vinyals O, Zaremba W (1997).Addressing the Rare Word Problem in<br />
Neural Machine Translation, </ref>,. <br />
<br />
Another approach to explaining the performance gains of RNNsearch over RNNencdec is due to RNNsearch's usage of the Bi-Directional RNN (BiRNN) as both encoder and decoder. As explained by Schuster and Paliwal (1997) <ref>Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45 (11), 2673–2681</ref>, compared to traditional RNN which only explores past data, BiRNN considers both past and future contexts.<br />
<br />
One of the main drawbacks of the method is that, since the complexity of training increases as the number of target words increases, the number of target worlds must be limited (30000-80000). Since most languages are much larger than this, there may be at least a few words in sentences that are not covered by the shortlist (especially for languages with a rich set of words).<br />
<br />
=Reference= <br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Machine_Translation:_Jointly_Learning_to_Align_and_Translate&diff=26195neural Machine Translation: Jointly Learning to Align and Translate2015-11-13T02:38:59Z<p>Rqiao: /* Decoding */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.<br />
<br />
In this paper, for the activation function of an RNN, the gated hidden unit is used which is similar to a long short-term memory (LSTM), but is able to better maintain contextual information from early until late in a sentence.Additionally, in the introduced method, the encoder assigns a context-dependent vector, or annotation, to every source word. The decoder then selectively combines the most relevant annotations to generate each target word; this implements a mechanism of attention in the decoder.<br />
<br />
= Previous methods =<br />
<br />
In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.<br />
<br />
In machine translation, the problem at hand is to identify the target sentence <math>y</math> (in natural language <math>B</math>) that is the most likely corresponding translation to the source sentence <math>x</math> (in natural language <math>A</math>). The authors compactly summarize this problem using the formula <math> \arg\max_{y} P(y|x)</math>.<br />
<br />
Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom<ref><br />
Kalchbrenner N, Blunsom P. Recurrent Continuous Translation Models[C]//EMNLP. 2013: 1700-1709.<br />
</ref>, Cho et al.<ref><br />
Cho K, van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. arXiv preprint arXiv:1409.1259, 2014.<br />
</ref>, Sutvesker et al.<ref><br />
Sutskever I, Vinyals O, Le Q V V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.<br />
</ref> has built a neural machine translation to directly learn the conditional probability distribution between input <math>x</math> and output <math>y</math>. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.<br />
<br />
<br />
== Encoding ==<br />
<br />
Typically, the encoding step iterates through the input vectors in the representation of source sentence <math>x</math> and updates a hidden state with each new token in the input: <math>h_t = f(x_t, h_{t-1})</math>, for some nonlinear function <math>f</math>. After the entire input is read, the resulting fixed-length representation of the entire input sentence <math>x</math> is given by a nonlinear function <math>q</math> of all of the hidden states: <math>c = q(\{h_1, \ldots, h_{T_x}\})</math>. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.<br />
<br />
== Decoding == <br />
<br />
Decoding the fixed-length representation <math>c</math> of <math>x</math> is done by predicting one token of the target sentence <math>y</math> at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities <math>P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)</math>. <br />
<br />
In the neural network approach, the conditional probability of the next output term given the previous ones <math>P(y_t | \{y_1, \ldots, y_{t-1}\},c)</math> is given by the evaluation of a nonlinear function <math>g(y_{t-1}, s_t, c)</math>, where <math>s_t</math> is the hidden state of the RNN.<br />
<br />
= The proposed method =<br />
<br />
The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector <math>c</math> altogether, and instead aligns the tokens of the translated sentence <math>y</math> directly with the corresponding tokens of source sentence <math>x</math> as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.<br />
<br />
== Encoding ==<br />
<br />
The proposed model does not use an ordinary recursive neural network to encode the target sentence <math>x</math>, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of <math>x</math> in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of <math>x</math> is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector <math>h_j</math> of the token <math>x_j</math> in <math>x</math> is given by the concatenation of these two hidden states vectors.<br />
<br />
== Aligment ==<br />
<br />
An alignment model (in the form of a neural network) is used to measure how well each annotation <math>h_j</math> of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector <math>s_{i-1}</math>, the hidden state of the RNN that identifies the tokens in the output sentence <math>y</math>. This is stored as the energy score <math>e_{ij} = a(s_{i-1}, h_j)</math>. <br />
<br />
The energy scores from the alignment process are used to assign weights <math>\alpha_{ij}</math> to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:<br />
<br />
<math>\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}</math><br />
<br />
The weights are then applied to the annotations to obtain the current context vector input: <br />
<br />
<math>c_i = \Sigma_j \alpha_{ij}h_j</math><br />
<br />
Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector <math>c</math>; rather, every time we translate a new word in the sentence, a new representation vector <math>c_i</math> is produced. This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector <math>c_i</math>).<br />
<br />
== Decoding ==<br />
<br />
The decoding is done by using an RNN to model a probability distribution on the conditional probabilities <br />
<br />
<math>P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)</math><br />
<br />
where here, <math>s_i</math> is the RNN hidden state at the previous time step,computed by <br />
<br />
<math>s_i = f (s_{i-1}, y_{i-1}, c_i)</math><br />
<br />
and <math>c_i</math> is the current context vector representation as discussed above under Alignment.<br />
<br />
Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.<br />
<br />
== Experiment Settings == <br />
The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).<br />
Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>. <br />
<br />
The WMT '14 dataset actually contains the following corpora, totaling (850M words):<br />
* Europarl (61M words)<br />
* News Commentary (5.5M words)<br />
* UN (421M words) <br />
* Crawled corpora (90M and 272.5 words)<br />
<br />
This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.<br />
<br />
Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.<br />
<br />
= Results =<br />
<br />
The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50. They used a shortlist of 30,000 most frequent words in each language to train their models and any word not included in the shortlist is mapped to a special token representing unknown word.<br />
<br />
Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.<br />
<br />
The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.<br />
<br />
[[File:RNNsearch_Graph.jpg]]<br />
<br />
Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.<br />
<br />
= Conclusion and Comments =<br />
<br />
Overall, the algorithm proposed by the authors gives a new and seemingly useful approach towards machine translation, particularly for translating long sentences.<br />
<br />
The performance appears to be good, but it would be interesting to see if it can be maintained when translating between languages that are not as closely aligned naturally as English and French usually are. The authors briefly refer to other languages (such as German) but do not provide any experiments or detailed comments to describe how the algorithm would perform in such cases. <br />
<br />
It is also interesting to note that, while the performance was always shown to be better for RNNsearch than for the older RNNencdec model, the former also includes more hidden units overall in its models than the latter. RNNencdec was mentioned as having 1000 hidden units for each of its encoding and decoding RNNs, giving a total of 2000; meanwhile, RNNsearch had 1000 hidden units for each the forward and backward RNNs in encoding, as well as 1000 more for the decoding RNN, giving a total of 3000. This is perhaps a worthy point to take into consideration when judging the relative performance of the two models objectively.<br />
<br />
Compare to some other algorithms, the performance of proposed algorithm for rare words, even in English to French translation is not good enough. For long sentences with large number of rare words the algorithm which uses a deep LSTM to encode the input sequence and a separate deep LSTM to output the translation works more accurate with larger BLEU score. <ref> Sutskever I, Le Q, Vinyals O, Zaremba W (1997).Addressing the Rare Word Problem in<br />
Neural Machine Translation, </ref>,. <br />
<br />
Another approach to explaining the performance gains of RNNsearch over RNNencdec is due to RNNsearch's usage of the Bi-Directional RNN (BiRNN) as both encoder and decoder. As explained by Schuster and Paliwal (1997) <ref>Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45 (11), 2673–2681</ref>, compared to traditional RNN which only explores past data, BiRNN considers both past and future contexts.<br />
<br />
One of the main drawbacks of the method is that, since the complexity of training increases as the number of target words increases, the number of target worlds must be limited (30000-80000). Since most languages are much larger than this, there may be at least a few words in sentences that are not covered by the shortlist (especially for languages with a rich set of words).<br />
<br />
=Reference= <br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Machine_Translation:_Jointly_Learning_to_Align_and_Translate&diff=26194neural Machine Translation: Jointly Learning to Align and Translate2015-11-13T02:35:09Z<p>Rqiao: /* Decoding */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.<br />
<br />
In this paper, for the activation function of an RNN, the gated hidden unit is used which is similar to a long short-term memory (LSTM), but is able to better maintain contextual information from early until late in a sentence.Additionally, in the introduced method, the encoder assigns a context-dependent vector, or annotation, to every source word. The decoder then selectively combines the most relevant annotations to generate each target word; this implements a mechanism of attention in the decoder.<br />
<br />
= Previous methods =<br />
<br />
In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.<br />
<br />
In machine translation, the problem at hand is to identify the target sentence <math>y</math> (in natural language <math>B</math>) that is the most likely corresponding translation to the source sentence <math>x</math> (in natural language <math>A</math>). The authors compactly summarize this problem using the formula <math> \arg\max_{y} P(y|x)</math>.<br />
<br />
Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom<ref><br />
Kalchbrenner N, Blunsom P. Recurrent Continuous Translation Models[C]//EMNLP. 2013: 1700-1709.<br />
</ref>, Cho et al.<ref><br />
Cho K, van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. arXiv preprint arXiv:1409.1259, 2014.<br />
</ref>, Sutvesker et al.<ref><br />
Sutskever I, Vinyals O, Le Q V V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.<br />
</ref> has built a neural machine translation to directly learn the conditional probability distribution between input <math>x</math> and output <math>y</math>. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.<br />
<br />
<br />
== Encoding ==<br />
<br />
Typically, the encoding step iterates through the input vectors in the representation of source sentence <math>x</math> and updates a hidden state with each new token in the input: <math>h_t = f(x_t, h_{t-1})</math>, for some nonlinear function <math>f</math>. After the entire input is read, the resulting fixed-length representation of the entire input sentence <math>x</math> is given by a nonlinear function <math>q</math> of all of the hidden states: <math>c = q(\{h_1, \ldots, h_{T_x}\})</math>. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.<br />
<br />
== Decoding == <br />
<br />
Decoding the fixed-length representation <math>c</math> of <math>x</math> is done by predicting one token of the target sentence <math>y</math> at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities <math>P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)</math>. <br />
<br />
In the neural network approach, the conditional probability of the next output term given the previous ones <math>P(y_t | \{y_1, \ldots, y_{t-1}\},c)</math> is given by the evaluation of a nonlinear function <math>g(y_{t-1}, s_t, c)</math>, where <math>s_t</math> is the hidden state of the RNN.<br />
<br />
= The proposed method =<br />
<br />
The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector <math>c</math> altogether, and instead aligns the tokens of the translated sentence <math>y</math> directly with the corresponding tokens of source sentence <math>x</math> as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.<br />
<br />
== Encoding ==<br />
<br />
The proposed model does not use an ordinary recursive neural network to encode the target sentence <math>x</math>, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of <math>x</math> in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of <math>x</math> is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector <math>h_j</math> of the token <math>x_j</math> in <math>x</math> is given by the concatenation of these two hidden states vectors.<br />
<br />
== Aligment ==<br />
<br />
An alignment model (in the form of a neural network) is used to measure how well each annotation <math>h_j</math> of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector <math>s_{i-1}</math>, the hidden state of the RNN that identifies the tokens in the output sentence <math>y</math>. This is stored as the energy score <math>e_{ij} = a(s_{i-1}, h_j)</math>. <br />
<br />
The energy scores from the alignment process are used to assign weights <math>\alpha_{ij}</math> to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:<br />
<br />
<math>\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}</math><br />
<br />
The weights are then applied to the annotations to obtain the current context vector input: <br />
<br />
<math>c_i = \Sigma_j \alpha_{ij}h_j</math><br />
<br />
Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector <math>c</math>; rather, every time we translate a new word in the sentence, a new representation vector <math>c_i</math> is produced. This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector <math>c_i</math>).<br />
<br />
== Decoding ==<br />
<br />
The decoding is done by using an RNN to model a probability distribution on the conditional probabilities <br />
<br />
<math>P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)</math><br />
<br />
where here, <math>s_i</math> is the RNN hidden state at the previous time step,computed by <br />
<br />
<math>s_i = f(s_{i-1}, y_{i-1}, c_i)</math><br />
<br />
and <math>c_i</math> is the current context vector representation as discussed above under Alignment.<br />
<br />
Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.<br />
<br />
== Experiment Settings == <br />
The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).<br />
Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>. <br />
<br />
The WMT '14 dataset actually contains the following corpora, totaling (850M words):<br />
* Europarl (61M words)<br />
* News Commentary (5.5M words)<br />
* UN (421M words) <br />
* Crawled corpora (90M and 272.5 words)<br />
<br />
This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.<br />
<br />
Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.<br />
<br />
= Results =<br />
<br />
The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50. They used a shortlist of 30,000 most frequent words in each language to train their models and any word not included in the shortlist is mapped to a special token representing unknown word.<br />
<br />
Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.<br />
<br />
The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.<br />
<br />
[[File:RNNsearch_Graph.jpg]]<br />
<br />
Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.<br />
<br />
= Conclusion and Comments =<br />
<br />
Overall, the algorithm proposed by the authors gives a new and seemingly useful approach towards machine translation, particularly for translating long sentences.<br />
<br />
The performance appears to be good, but it would be interesting to see if it can be maintained when translating between languages that are not as closely aligned naturally as English and French usually are. The authors briefly refer to other languages (such as German) but do not provide any experiments or detailed comments to describe how the algorithm would perform in such cases. <br />
<br />
It is also interesting to note that, while the performance was always shown to be better for RNNsearch than for the older RNNencdec model, the former also includes more hidden units overall in its models than the latter. RNNencdec was mentioned as having 1000 hidden units for each of its encoding and decoding RNNs, giving a total of 2000; meanwhile, RNNsearch had 1000 hidden units for each the forward and backward RNNs in encoding, as well as 1000 more for the decoding RNN, giving a total of 3000. This is perhaps a worthy point to take into consideration when judging the relative performance of the two models objectively.<br />
<br />
Compare to some other algorithms, the performance of proposed algorithm for rare words, even in English to French translation is not good enough. For long sentences with large number of rare words the algorithm which uses a deep LSTM to encode the input sequence and a separate deep LSTM to output the translation works more accurate with larger BLEU score. <ref> Sutskever I, Le Q, Vinyals O, Zaremba W (1997).Addressing the Rare Word Problem in<br />
Neural Machine Translation, </ref>,. <br />
<br />
Another approach to explaining the performance gains of RNNsearch over RNNencdec is due to RNNsearch's usage of the Bi-Directional RNN (BiRNN) as both encoder and decoder. As explained by Schuster and Paliwal (1997) <ref>Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45 (11), 2673–2681</ref>, compared to traditional RNN which only explores past data, BiRNN considers both past and future contexts.<br />
<br />
One of the main drawbacks of the method is that, since the complexity of training increases as the number of target words increases, the number of target worlds must be limited (30000-80000). Since most languages are much larger than this, there may be at least a few words in sentences that are not covered by the shortlist (especially for languages with a rich set of words).<br />
<br />
=Reference= <br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Machine_Translation:_Jointly_Learning_to_Align_and_Translate&diff=26187neural Machine Translation: Jointly Learning to Align and Translate2015-11-13T02:18:05Z<p>Rqiao: /* Reference */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.<br />
<br />
In this paper, for the activation function of an RNN, the gated hidden unit is used which is similar to a long short-term memory (LSTM), but is able to better maintain contextual information from early until late in a sentence.Additionally, in the introduced method, the encoder assigns a context-dependent vector, or annotation, to every source word. The decoder then selectively combines the most relevant annotations to generate each target word; this implements a mechanism of attention in the decoder.<br />
<br />
= Previous methods =<br />
<br />
In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.<br />
<br />
In machine translation, the problem at hand is to identify the target sentence <math>y</math> (in natural language <math>B</math>) that is the most likely corresponding translation to the source sentence <math>x</math> (in natural language <math>A</math>). The authors compactly summarize this problem using the formula <math> \arg\max_{y} P(y|x)</math>.<br />
<br />
Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom<ref><br />
Kalchbrenner N, Blunsom P. Recurrent Continuous Translation Models[C]//EMNLP. 2013: 1700-1709.<br />
</ref>, Cho et al.<ref><br />
Cho K, van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. arXiv preprint arXiv:1409.1259, 2014.<br />
</ref>, Sutvesker et al.<ref><br />
Sutskever I, Vinyals O, Le Q V V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.<br />
</ref> has built a neural machine translation to directly learn the conditional probability distribution between input <math>x</math> and output <math>y</math>. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.<br />
<br />
<br />
== Encoding ==<br />
<br />
Typically, the encoding step iterates through the input vectors in the representation of source sentence <math>x</math> and updates a hidden state with each new token in the input: <math>h_t = f(x_t, h_{t-1})</math>, for some nonlinear function <math>f</math>. After the entire input is read, the resulting fixed-length representation of the entire input sentence <math>x</math> is given by a nonlinear function <math>q</math> of all of the hidden states: <math>c = q(\{h_1, \ldots, h_{T_x}\})</math>. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.<br />
<br />
== Decoding == <br />
<br />
Decoding the fixed-length representation <math>c</math> of <math>x</math> is done by predicting one token of the target sentence <math>y</math> at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities <math>P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)</math>. <br />
<br />
In the neural network approach, the conditional probability of the next output term given the previous ones <math>P(y_t | \{y_1, \ldots, y_{t-1}\},c)</math> is given by the evaluation of a nonlinear function <math>g(y_{t-1}, s_t, c)</math>, where <math>s_t</math> is the hidden state of the RNN.<br />
<br />
= The proposed method =<br />
<br />
The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector <math>c</math> altogether, and instead aligns the tokens of the translated sentence <math>y</math> directly with the corresponding tokens of source sentence <math>x</math> as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.<br />
<br />
== Encoding ==<br />
<br />
The proposed model does not use an ordinary recursive neural network to encode the target sentence <math>x</math>, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of <math>x</math> in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of <math>x</math> is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector <math>h_j</math> of the token <math>x_j</math> in <math>x</math> is given by the concatenation of these two hidden states vectors.<br />
<br />
== Aligment ==<br />
<br />
An alignment model (in the form of a neural network) is used to measure how well each annotation <math>h_j</math> of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector <math>s_{i-1}</math>, the hidden state of the RNN that identifies the tokens in the output sentence <math>y</math>. This is stored as the energy score <math>e_{ij} = a(s_{i-1}, h_j)</math>. <br />
<br />
The energy scores from the alignment process are used to assign weights <math>\alpha_{ij}</math> to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:<br />
<br />
<math>\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}</math><br />
<br />
The weights are then applied to the annotations to obtain the current context vector input: <br />
<br />
<math>c_i = \Sigma_j \alpha_{ij}h_j</math><br />
<br />
Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector <math>c</math>; rather, every time we translate a new word in the sentence, a new representation vector <math>c_i</math> is produced. This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector <math>c_i</math>).<br />
<br />
== Decoding ==<br />
<br />
The decoding is done by using an RNN to model a probability distribution on the conditional probabilities <br />
<br />
<math>P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)</math><br />
<br />
where here, <math>s_i</math> is the RNN hidden state at the previous time step, and <math>c_i</math> is the current context vector representation as discussed above under Alignment.<br />
<br />
Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.<br />
<br />
== Experiment Settings == <br />
The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).<br />
Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>. <br />
<br />
The WMT '14 dataset actually contains the following corpora, totaling (850M words):<br />
* Europarl (61M words)<br />
* News Commentary (5.5M words)<br />
* UN (421M words) <br />
* Crawled corpora (90M and 272.5 words)<br />
<br />
This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.<br />
<br />
Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.<br />
<br />
= Results =<br />
<br />
The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50. They used a shortlist of 30,000 most frequent words in each language to train their models and any word not included in the shortlist is mapped to a special token representing unknown word.<br />
<br />
Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.<br />
<br />
The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.<br />
<br />
[[File:RNNsearch_Graph.jpg]]<br />
<br />
Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.<br />
<br />
= Conclusion and Comments =<br />
<br />
Overall, the algorithm proposed by the authors gives a new and seemingly useful approach towards machine translation, particularly for translating long sentences.<br />
<br />
The performance appears to be good, but it would be interesting to see if it can be maintained when translating between languages that are not as closely aligned naturally as English and French usually are. The authors briefly refer to other languages (such as German) but do not provide any experiments or detailed comments to describe how the algorithm would perform in such cases. <br />
<br />
It is also interesting to note that, while the performance was always shown to be better for RNNsearch than for the older RNNencdec model, the former also includes more hidden units overall in its models than the latter. RNNencdec was mentioned as having 1000 hidden units for each of its encoding and decoding RNNs, giving a total of 2000; meanwhile, RNNsearch had 1000 hidden units for each the forward and backward RNNs in encoding, as well as 1000 more for the decoding RNN, giving a total of 3000. This is perhaps a worthy point to take into consideration when judging the relative performance of the two models objectively.<br />
<br />
Compare to some other algorithms, the performance of proposed algorithm for rare words, even in English to French translation is not good enough. For long sentences with large number of rare words the algorithm which uses a deep LSTM to encode the input sequence and a separate deep LSTM to output the translation works more accurate with larger BLEU score. <ref> Sutskever I, Le Q, Vinyals O, Zaremba W (1997).Addressing the Rare Word Problem in<br />
Neural Machine Translation, </ref>,. <br />
<br />
Another approach to explaining the performance gains of RNNsearch over RNNencdec is due to RNNsearch's usage of the Bi-Directional RNN (BiRNN) as both encoder and decoder. As explained by Schuster and Paliwal (1997) <ref>Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45 (11), 2673–2681</ref>, compared to traditional RNN which only explores past data, BiRNN considers both past and future contexts.<br />
<br />
One of the main drawbacks of the method is that, since the complexity of training increases as the number of target words increases, the number of target worlds must be limited (30000-80000). Since most languages are much larger than this, there may be at least a few words in sentences that are not covered by the shortlist (especially for languages with a rich set of words).<br />
<br />
=Reference= <br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Machine_Translation:_Jointly_Learning_to_Align_and_Translate&diff=26186neural Machine Translation: Jointly Learning to Align and Translate2015-11-13T02:17:13Z<p>Rqiao: </p>
<hr />
<div>= Introduction =<br />
<br />
In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.<br />
<br />
In this paper, for the activation function of an RNN, the gated hidden unit is used which is similar to a long short-term memory (LSTM), but is able to better maintain contextual information from early until late in a sentence.Additionally, in the introduced method, the encoder assigns a context-dependent vector, or annotation, to every source word. The decoder then selectively combines the most relevant annotations to generate each target word; this implements a mechanism of attention in the decoder.<br />
<br />
= Previous methods =<br />
<br />
In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.<br />
<br />
In machine translation, the problem at hand is to identify the target sentence <math>y</math> (in natural language <math>B</math>) that is the most likely corresponding translation to the source sentence <math>x</math> (in natural language <math>A</math>). The authors compactly summarize this problem using the formula <math> \arg\max_{y} P(y|x)</math>.<br />
<br />
Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom<ref><br />
Kalchbrenner N, Blunsom P. Recurrent Continuous Translation Models[C]//EMNLP. 2013: 1700-1709.<br />
</ref>, Cho et al.<ref><br />
Cho K, van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. arXiv preprint arXiv:1409.1259, 2014.<br />
</ref>, Sutvesker et al.<ref><br />
Sutskever I, Vinyals O, Le Q V V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.<br />
</ref> has built a neural machine translation to directly learn the conditional probability distribution between input <math>x</math> and output <math>y</math>. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.<br />
<br />
<br />
== Encoding ==<br />
<br />
Typically, the encoding step iterates through the input vectors in the representation of source sentence <math>x</math> and updates a hidden state with each new token in the input: <math>h_t = f(x_t, h_{t-1})</math>, for some nonlinear function <math>f</math>. After the entire input is read, the resulting fixed-length representation of the entire input sentence <math>x</math> is given by a nonlinear function <math>q</math> of all of the hidden states: <math>c = q(\{h_1, \ldots, h_{T_x}\})</math>. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.<br />
<br />
== Decoding == <br />
<br />
Decoding the fixed-length representation <math>c</math> of <math>x</math> is done by predicting one token of the target sentence <math>y</math> at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities <math>P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)</math>. <br />
<br />
In the neural network approach, the conditional probability of the next output term given the previous ones <math>P(y_t | \{y_1, \ldots, y_{t-1}\},c)</math> is given by the evaluation of a nonlinear function <math>g(y_{t-1}, s_t, c)</math>, where <math>s_t</math> is the hidden state of the RNN.<br />
<br />
= The proposed method =<br />
<br />
The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector <math>c</math> altogether, and instead aligns the tokens of the translated sentence <math>y</math> directly with the corresponding tokens of source sentence <math>x</math> as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.<br />
<br />
== Encoding ==<br />
<br />
The proposed model does not use an ordinary recursive neural network to encode the target sentence <math>x</math>, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of <math>x</math> in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of <math>x</math> is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector <math>h_j</math> of the token <math>x_j</math> in <math>x</math> is given by the concatenation of these two hidden states vectors.<br />
<br />
== Aligment ==<br />
<br />
An alignment model (in the form of a neural network) is used to measure how well each annotation <math>h_j</math> of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector <math>s_{i-1}</math>, the hidden state of the RNN that identifies the tokens in the output sentence <math>y</math>. This is stored as the energy score <math>e_{ij} = a(s_{i-1}, h_j)</math>. <br />
<br />
The energy scores from the alignment process are used to assign weights <math>\alpha_{ij}</math> to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:<br />
<br />
<math>\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}</math><br />
<br />
The weights are then applied to the annotations to obtain the current context vector input: <br />
<br />
<math>c_i = \Sigma_j \alpha_{ij}h_j</math><br />
<br />
Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector <math>c</math>; rather, every time we translate a new word in the sentence, a new representation vector <math>c_i</math> is produced. This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector <math>c_i</math>).<br />
<br />
== Decoding ==<br />
<br />
The decoding is done by using an RNN to model a probability distribution on the conditional probabilities <br />
<br />
<math>P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)</math><br />
<br />
where here, <math>s_i</math> is the RNN hidden state at the previous time step, and <math>c_i</math> is the current context vector representation as discussed above under Alignment.<br />
<br />
Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.<br />
<br />
== Experiment Settings == <br />
The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).<br />
Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>. <br />
<br />
The WMT '14 dataset actually contains the following corpora, totaling (850M words):<br />
* Europarl (61M words)<br />
* News Commentary (5.5M words)<br />
* UN (421M words) <br />
* Crawled corpora (90M and 272.5 words)<br />
<br />
This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.<br />
<br />
Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.<br />
<br />
= Results =<br />
<br />
The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50. They used a shortlist of 30,000 most frequent words in each language to train their models and any word not included in the shortlist is mapped to a special token representing unknown word.<br />
<br />
Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.<br />
<br />
The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.<br />
<br />
[[File:RNNsearch_Graph.jpg]]<br />
<br />
Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.<br />
<br />
= Conclusion and Comments =<br />
<br />
Overall, the algorithm proposed by the authors gives a new and seemingly useful approach towards machine translation, particularly for translating long sentences.<br />
<br />
The performance appears to be good, but it would be interesting to see if it can be maintained when translating between languages that are not as closely aligned naturally as English and French usually are. The authors briefly refer to other languages (such as German) but do not provide any experiments or detailed comments to describe how the algorithm would perform in such cases. <br />
<br />
It is also interesting to note that, while the performance was always shown to be better for RNNsearch than for the older RNNencdec model, the former also includes more hidden units overall in its models than the latter. RNNencdec was mentioned as having 1000 hidden units for each of its encoding and decoding RNNs, giving a total of 2000; meanwhile, RNNsearch had 1000 hidden units for each the forward and backward RNNs in encoding, as well as 1000 more for the decoding RNN, giving a total of 3000. This is perhaps a worthy point to take into consideration when judging the relative performance of the two models objectively.<br />
<br />
Compare to some other algorithms, the performance of proposed algorithm for rare words, even in English to French translation is not good enough. For long sentences with large number of rare words the algorithm which uses a deep LSTM to encode the input sequence and a separate deep LSTM to output the translation works more accurate with larger BLEU score. <ref> Sutskever I, Le Q, Vinyals O, Zaremba W (1997).Addressing the Rare Word Problem in<br />
Neural Machine Translation, </ref>,. <br />
<br />
Another approach to explaining the performance gains of RNNsearch over RNNencdec is due to RNNsearch's usage of the Bi-Directional RNN (BiRNN) as both encoder and decoder. As explained by Schuster and Paliwal (1997) <ref>Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45 (11), 2673–2681</ref>, compared to traditional RNN which only explores past data, BiRNN considers both past and future contexts.<br />
<br />
One of the main drawbacks of the method is that, since the complexity of training increases as the number of target words increases, the number of target worlds must be limited (30000-80000). Since most languages are much larger than this, there may be at least a few words in sentences that are not covered by the shortlist (especially for languages with a rich set of words).<br />
<br />
=Reference= <br />
</reference></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Machine_Translation:_Jointly_Learning_to_Align_and_Translate&diff=26185neural Machine Translation: Jointly Learning to Align and Translate2015-11-13T02:16:18Z<p>Rqiao: /* Previous methods */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.<br />
<br />
In this paper, for the activation function of an RNN, the gated hidden unit is used which is similar to a long short-term memory (LSTM), but is able to better maintain contextual information from early until late in a sentence.Additionally, in the introduced method, the encoder assigns a context-dependent vector, or annotation, to every source word. The decoder then selectively combines the most relevant annotations to generate each target word; this implements a mechanism of attention in the decoder.<br />
<br />
= Previous methods =<br />
<br />
In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.<br />
<br />
In machine translation, the problem at hand is to identify the target sentence <math>y</math> (in natural language <math>B</math>) that is the most likely corresponding translation to the source sentence <math>x</math> (in natural language <math>A</math>). The authors compactly summarize this problem using the formula <math> \arg\max_{y} P(y|x)</math>.<br />
<br />
Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom<ref><br />
Kalchbrenner N, Blunsom P. Recurrent Continuous Translation Models[C]//EMNLP. 2013: 1700-1709.<br />
</ref>, Cho et al.<ref><br />
Cho K, van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. arXiv preprint arXiv:1409.1259, 2014.<br />
</ref>, Sutvesker et al.<ref><br />
Sutskever I, Vinyals O, Le Q V V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.<br />
</ref> has built a neural machine translation to directly learn the conditional probability distribution between input <math>x</math> and output <math>y</math>. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.<br />
<br />
<br />
== Encoding ==<br />
<br />
Typically, the encoding step iterates through the input vectors in the representation of source sentence <math>x</math> and updates a hidden state with each new token in the input: <math>h_t = f(x_t, h_{t-1})</math>, for some nonlinear function <math>f</math>. After the entire input is read, the resulting fixed-length representation of the entire input sentence <math>x</math> is given by a nonlinear function <math>q</math> of all of the hidden states: <math>c = q(\{h_1, \ldots, h_{T_x}\})</math>. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.<br />
<br />
== Decoding == <br />
<br />
Decoding the fixed-length representation <math>c</math> of <math>x</math> is done by predicting one token of the target sentence <math>y</math> at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities <math>P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)</math>. <br />
<br />
In the neural network approach, the conditional probability of the next output term given the previous ones <math>P(y_t | \{y_1, \ldots, y_{t-1}\},c)</math> is given by the evaluation of a nonlinear function <math>g(y_{t-1}, s_t, c)</math>, where <math>s_t</math> is the hidden state of the RNN.<br />
<br />
= The proposed method =<br />
<br />
The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector <math>c</math> altogether, and instead aligns the tokens of the translated sentence <math>y</math> directly with the corresponding tokens of source sentence <math>x</math> as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.<br />
<br />
== Encoding ==<br />
<br />
The proposed model does not use an ordinary recursive neural network to encode the target sentence <math>x</math>, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of <math>x</math> in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of <math>x</math> is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector <math>h_j</math> of the token <math>x_j</math> in <math>x</math> is given by the concatenation of these two hidden states vectors.<br />
<br />
== Aligment ==<br />
<br />
An alignment model (in the form of a neural network) is used to measure how well each annotation <math>h_j</math> of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector <math>s_{i-1}</math>, the hidden state of the RNN that identifies the tokens in the output sentence <math>y</math>. This is stored as the energy score <math>e_{ij} = a(s_{i-1}, h_j)</math>. <br />
<br />
The energy scores from the alignment process are used to assign weights <math>\alpha_{ij}</math> to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:<br />
<br />
<math>\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}</math><br />
<br />
The weights are then applied to the annotations to obtain the current context vector input: <br />
<br />
<math>c_i = \Sigma_j \alpha_{ij}h_j</math><br />
<br />
Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector <math>c</math>; rather, every time we translate a new word in the sentence, a new representation vector <math>c_i</math> is produced. This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector <math>c_i</math>).<br />
<br />
== Decoding ==<br />
<br />
The decoding is done by using an RNN to model a probability distribution on the conditional probabilities <br />
<br />
<math>P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)</math><br />
<br />
where here, <math>s_i</math> is the RNN hidden state at the previous time step, and <math>c_i</math> is the current context vector representation as discussed above under Alignment.<br />
<br />
Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.<br />
<br />
== Experiment Settings == <br />
The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).<br />
Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>. <br />
<br />
The WMT '14 dataset actually contains the following corpora, totaling (850M words):<br />
* Europarl (61M words)<br />
* News Commentary (5.5M words)<br />
* UN (421M words) <br />
* Crawled corpora (90M and 272.5 words)<br />
<br />
This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.<br />
<br />
Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.<br />
<br />
= Results =<br />
<br />
The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50. They used a shortlist of 30,000 most frequent words in each language to train their models and any word not included in the shortlist is mapped to a special token representing unknown word.<br />
<br />
Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.<br />
<br />
The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.<br />
<br />
[[File:RNNsearch_Graph.jpg]]<br />
<br />
Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.<br />
<br />
= Conclusion and Comments =<br />
<br />
Overall, the algorithm proposed by the authors gives a new and seemingly useful approach towards machine translation, particularly for translating long sentences.<br />
<br />
The performance appears to be good, but it would be interesting to see if it can be maintained when translating between languages that are not as closely aligned naturally as English and French usually are. The authors briefly refer to other languages (such as German) but do not provide any experiments or detailed comments to describe how the algorithm would perform in such cases. <br />
<br />
It is also interesting to note that, while the performance was always shown to be better for RNNsearch than for the older RNNencdec model, the former also includes more hidden units overall in its models than the latter. RNNencdec was mentioned as having 1000 hidden units for each of its encoding and decoding RNNs, giving a total of 2000; meanwhile, RNNsearch had 1000 hidden units for each the forward and backward RNNs in encoding, as well as 1000 more for the decoding RNN, giving a total of 3000. This is perhaps a worthy point to take into consideration when judging the relative performance of the two models objectively.<br />
<br />
Compare to some other algorithms, the performance of proposed algorithm for rare words, even in English to French translation is not good enough. For long sentences with large number of rare words the algorithm which uses a deep LSTM to encode the input sequence and a separate deep LSTM to output the translation works more accurate with larger BLEU score. <ref> Sutskever I, Le Q, Vinyals O, Zaremba W (1997).Addressing the Rare Word Problem in<br />
Neural Machine Translation, </ref>,. <br />
<br />
Another approach to explaining the performance gains of RNNsearch over RNNencdec is due to RNNsearch's usage of the Bi-Directional RNN (BiRNN) as both encoder and decoder. As explained by Schuster and Paliwal (1997) <ref>Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45 (11), 2673–2681</ref>, compared to traditional RNN which only explores past data, BiRNN considers both past and future contexts.<br />
<br />
One of the main drawbacks of the method is that, since the complexity of training increases as the number of target words increases, the number of target worlds must be limited (30000-80000). Since most languages are much larger than this, there may be at least a few words in sentences that are not covered by the shortlist (especially for languages with a rich set of words).</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26150graves et al., Speech recognition with deep recurrent neural networks2015-11-12T17:20:45Z<p>Rqiao: /* Further works */</p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_GravesFGS06.pdf “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks” ] in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, [http://download.springer.com/static/pdf/292/bok%253A978-3-642-24797-2.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Fbook%2F10.1007%2F978-3-642-24797-2&token2=exp=1447349790~acl=%2Fstatic%2Fpdf%2F292%2Fbok%25253A978-3-642-24797-2.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Fbook%252F10.1007%252F978-3-642-24797-2*~hmac=06056ebc10e0f5e35b951ae6ab82de8b5c5bd9d8042084c07713a8946f561a7b Supervised sequence labelling with recurrentneural networks], vol. 385, Springer, 2012.</ref> <br />
<ref> A. Graves, [http://arxiv.org/pdf/1211.3711v1.pdf “Sequence transduction with recurrent neural networks”] in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, [http://papers.nips.cc/paper/3213-unconstrained-on-line-handwriting-recognition-with-recurrent-neural-networks.pdf “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks”] in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=650093 “Bidirectional Recurrent Neural Networks”] IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a description of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= Further works =<br />
The first two authors developed the method to be able to readily be integrated into word-level language models <ref> Graves, A.; Jaitly, N.; Mohamed, A.-R, “Hybrid speech recognition with Deep Bidirectional LSTM," [http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6707742]</ref>. They used a hybrid approach where frame-level acoustic targets produced by a forced alignment given by a GMM-HMM system.<br />
<br />
= References =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26149graves et al., Speech recognition with deep recurrent neural networks2015-11-12T17:19:44Z<p>Rqiao: /* Motivation */</p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_GravesFGS06.pdf “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks” ] in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, [http://download.springer.com/static/pdf/292/bok%253A978-3-642-24797-2.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Fbook%2F10.1007%2F978-3-642-24797-2&token2=exp=1447349790~acl=%2Fstatic%2Fpdf%2F292%2Fbok%25253A978-3-642-24797-2.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Fbook%252F10.1007%252F978-3-642-24797-2*~hmac=06056ebc10e0f5e35b951ae6ab82de8b5c5bd9d8042084c07713a8946f561a7b Supervised sequence labelling with recurrentneural networks], vol. 385, Springer, 2012.</ref> <br />
<ref> A. Graves, [http://arxiv.org/pdf/1211.3711v1.pdf “Sequence transduction with recurrent neural networks”] in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, [http://papers.nips.cc/paper/3213-unconstrained-on-line-handwriting-recognition-with-recurrent-neural-networks.pdf “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks”] in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=650093 “Bidirectional Recurrent Neural Networks”] IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a description of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= Further works =<br />
The first two authors developed the method to be able to readily be integrated into word-level language models <ref> Graves, A.; Jaitly, N.; Mohamed, A.-R, “Hybrid speech recognition with Deep Bidirectional LSTM," [http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6707742]</ref>. They used a hybrid approach where frame-level acoustic targets produced by a forced alignment given by a GMM-HMM system. <br />
<br />
= References =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26148graves et al., Speech recognition with deep recurrent neural networks2015-11-12T17:16:06Z<p>Rqiao: /* Motivation */</p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_GravesFGS06.pdf “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks” ] in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,”IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a description of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= Further works =<br />
The first two authors developed the method to be able to readily be integrated into word-level language models <ref> Graves, A.; Jaitly, N.; Mohamed, A.-R, “Hybrid speech recognition with Deep Bidirectional LSTM," [http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6707742]</ref>. They used a hybrid approach where frame-level acoustic targets produced by a forced alignment given by a GMM-HMM system. <br />
<br />
= References =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26147graves et al., Speech recognition with deep recurrent neural networks2015-11-12T17:14:10Z<p>Rqiao: /* References */</p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,”IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a description of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= Further works =<br />
The first two authors developed the method to be able to readily be integrated into word-level language models <ref> Graves, A.; Jaitly, N.; Mohamed, A.-R, “Hybrid speech recognition with Deep Bidirectional LSTM," [http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6707742]</ref>. They used a hybrid approach where frame-level acoustic targets produced by a forced alignment given by a GMM-HMM system. <br />
<br />
= References =<br />
<references /></div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=25956f15Stat946PaperSignUp2015-11-07T20:00:29Z<p>Rqiao: /* Set B */</p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || From Machine Learning to Machine Reasoning ||[http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]||<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/abs/10.1021/ci500747n.pdf Paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || A fast learning algorithm for deep belief nets || [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 Paper] || [[A fast learning algorithm for deep belief nets | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || || Towards AI-complete question answering: a set of prerequisite toy tasks || [http://arxiv.org/pdf/1502.05698.pdf Paper] ||<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]||<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] ||<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] ||<br />
|-<br />
|Tim Tse|| || Question answering with subgraph embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] ||<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=25955f15Stat946PaperSignUp2015-11-07T20:00:05Z<p>Rqiao: </p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || From Machine Learning to Machine Reasoning ||[http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]||<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/abs/10.1021/ci500747n.pdf Paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || A fast learning algorithm for deep belief nets || [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 Paper] || [[A fast learning algorithm for deep belief nets | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || || Towards AI-complete question answering: a set of prerequisite toy tasks || [http://arxiv.org/pdf/1502.05698.pdf Paper] ||<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]||<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] ||<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] ||<br />
|-<br />
|Tim Tse|| || Question answering with subgraph embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] ||<br />
<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]</div>Rqiaohttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Hierarchical_Features_for_Scene_Labeling&diff=25837learning Hierarchical Features for Scene Labeling2015-11-05T01:08:46Z<p>Rqiao: /* Network Architecture */</p>
<hr />
<div>= Introduction =<br />
<br />
'''Test input''': The input into the network was a static image such as the one below:<br />
<br />
[[File:cows_in_field.png | 500px ]]<br />
<br />
'''Training data and desired result''': The desired result (which is the same format as the training data given to the network for supervised learning) is an image with large features labelled.<br />
<br />
<gallery widths="500px" heights="400px"><br />
Image:labeled_cows.png|Labeled Result<br />
<br />
</gallery><br />
<br />
[[File:cow_legend.png]]<br />
<br />
One of the difficulties in solving this problem is that traditional convolutional neural networks (CNNs) only take a small region around each pixel into account which is often not sufficient for labeling it as the correct label is determined by the context on a larger scale. To tackle this problems the authors extend the method of sharing weights between spatial locations as in traditional CNNs to share weights across multiple scales. This is achieved by generating multiple scaled versions of the input image. Furthermore, the weight sharing across scales leads to the learning of scale-invariant features.<br />
<br />
A multi-scale convolutional network is trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel for scene labeling. Also a technique is proposed to automatically retrieve an optimal set of components that best explain the scene from a pool of segmentation components.<br />
<br />
= Related work =<br />
<br />
There is only one previously published work on using convolutional networks for scene parsing.<ref><br />
Grangier, David, Léon Bottou, and Ronan Collobert. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.8571&rep=rep1&type=pdf "Deep convolutional networks for scene parsing."] ICML 2009 Deep Learning Workshop. Vol. 3. 2009.<br />
</ref><br />
While somewhat preliminary, their work showed that convolutional networks fed with raw pixels could be trained to perform scene parsing with decent accuracy.<br />
<br />
= Methodology =<br />
<br />
Below we can see a flow of the overall approach.<br />
<br />
[[File:yann_flow.png | 1200px ]]<br />
<br />
The model proposed by the paper is depicted above. In the first representation, an image patch is seen as as a point in <math>\mathbb R^P</math> and we seek to find a transform <math>f:\mathbb R^P \to \mathbb R^Q</math> that maps each path into <math>\mathbb R^Q</math>, a space where it can be classified linearly. The first representation usually suffers from two main problems with traditional convolutional neural networks: (1) the window considered rarely contains an object that is centred and scaled, (2) integrating a large context involves increasing the grid size and therefore the dimensionality of <math>P</math> and hence, it is then necessary to enforce some invariance in the function <math>f</math> itself. This is usually achieved through pooling but this degrades the model to precisely locate and delineate objects. In this paper, <math>f</math> is implemented by a mutliscale convolutional neural network, which allows integrating large contexts in local decisions while remaining manageable in terms of parameters/dimensionality.<br />
<br />
In the second representation, the image is seen as an edge-weighted graph, on which one or several oversegmentations can be constructed. The components are spatially accurate and naturally delineates objects as this representation conserves pixel-level precision.<br />
<br />
== Pre-processing ==<br />
<br />
Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below<br />
<br />
[[File:Image_pyramid.png]]<br />
<br />
The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.<br />
<br />
== Network Architecture ==<br />
<br />
The proposed scene parsing architecture has two main components: Multi-scale convolutional representation and Graph-based classification.<br />
<br />
In the first representation, for each scale of the Laplacian pyramid, a typical 3-stage (Each of the first 2 stages is composed of three layers: convolution of kernel with feature map, non-linearity, pooling) CNN architecture was used. The function tanh served as the non-linearity. The kernel being used were 7x7 Toeplitz matrices (matrices with constant values along their diagonals). The pooling operation was performed by the 2x2 max-pool operator. The same CNN was applied to all different sized images. Since the parameters were shared between the networks, the ''same'' connection weights were applied to all of the images, thus allowing for the detection of scale-invariant features. The outputs of all CNNs at each scale are upsampled and concatenated to produce a map of feature vectors. The author believe that the more scales used to jointly train the models, the better the representation becomes for all scales.<br />
<br />
In the second representation, the image is seen as an edge-weighted graph<ref><br />
Shotton, Jamie, et al.[http://www.csd.uwo.ca/~olga/Courses/Fall2013/CS9840/PossibleStudentPapers/eccv06.pdf "Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation." ]Computer Vision–ECCV 2006. Springer Berlin Heidelberg, 2006. 1-15.<br />
</ref><ref><br />
Fulkerson, Brian, Andrea Vedaldi, and Stefano Soatto. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.150.4613&rep=rep1&type=pdf "Class segmentation and object localization with superpixel neighborhoods."] Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.<br />
</ref>, on which one or several over-segmentations can be constructed and used to group the feature descriptors. This graph segmentation technique was taken from another paper<ref><br />
Felzenszwalb, Pedro F., and Daniel P. Huttenlocher.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.150.4613&rep=rep1&type=pdf "Efficient graph-based image segmentation."] International Journal of Computer Vision 59.2 (2004): 167-181.<br />
</ref>. Three techniques are proposed to produce the final image labelling as discussed below in the Post-Processing section.<br />
<br />
Stochastic gradient descent was used for training the filters. To avoid over-fitting the training images were edited via jitter, horizontal flipping, rotations between +8 and -8, and rescaling between 90 and 110%. The objective function was the ''cross entropy'' loss function, [https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/ which is a way to take into account the closeness of a prediction into the error].<br />
<br />
== Post-Processing ==<br />
<br />
Unlike previous approaches, the emphasis of this scene-labelling method was to rely on a highly accurate pixel labelling system. So, despite the fact that a variety of approaches were attempted, including SuperPixels, Conditional Random Fields and gPb, the simple approach of super-pixels yielded state of the art results.<br />
<br />
SuperPixels are randomly generated chunks of pixels. To label these pixels, a two layer neural network was used. Given an input of the feature vector from the CNN, the features were then averaged across the super-pixels. The picture below shows the general approach><br />
<br />
[[File:super_pix.png]]<br />
<br />
=== Conditional Random Fields ===<br />
<br />
A standard approach for labelling is training a CRF model on the superpixels. It consists of associating the image to a graph and define an energy function whose optimal solution corresponds to the desired<br />
segmentation.The Conditional Random Field (CRF) energy function is typically composed of a unary term enforcing the variable l to take values close to the predictions dˆ and a pairwise term enforcing regularity or local consistency of l. The CRF energy to minimize is given by <br />
<br />
[[File:Paper1p1.png ]]<br />
<br />
The entire process of using CRF can be summarized below<br />
<br />
[[File:Paper2p2.png | 1200px ]]<br />
<br />
= Model =<br />
<br />
'''Scale-invariant, Scene-level feature extraction'''<br />
<br />
Given an input image, a multiscale pyramid of images <math>\ X_s </math>, where <math>s</math> belongs to {1,...,N}, is constructed. The multiscale pyramid is typically pre-processed, so that local neighborhoods have zero mean and unit standard deviation. We denote <math>f_s</math> as a classical convolutional network with parameter <math>\theta_s</math>, where <math>\theta_s</math> is shared across <math>f_s</math>. <br />
<br />
For a network <math>f_s</math> with L layers, we have regular convolutional network:<br />
<br />
<math>\ f_s(X_s; \theta_s)=W_LH_{L-1}</math>.<br />
<br />
<math>\ H_L </math> is the vector of hidden units at layer L, where:<br />
<br />
<math>\ H_l=pool(tanh(W_lH_{l-1}+b_l))</math>, <math> b_l </math> is a vector of bias parameter<br />
<br />
Finally, the output of N networks are upsampled and concatenated so as to produce F:<br />
<br />
<math>\ F= [f_1, u(f_2), ... , u(f_N)]</math>, where <math> u</math> is an upsampling function. <br />
<br />
''' Classification '''<br />
<br />
Having <math>\ F</math>, we now want to classify the superpixels.<br />
<br />
<math>\ y_i= W_2tanh(W_1F_i+b_1)</math>, <br />
<br />
<math>\ W_1</math> and <math>\ W_2</math> are trainable parameters of the classifier. <br />
<br />
<math>\ \hat{d_{i,a}}=\frac{e^{y_{i,a}}}{\sum_{b\in classes}{e^{y_{i,b}}}}</math>, <br />
<br />
<math> \hat{d_{i,a}}</math> is the predicted class distribution from the linear classifier for pixel <math>i</math> and class <math>a</math>.<br />
<br />
<math>\ \hat{d_{k,a}}= \frac{1}{s(k)}\sum_{i\in k}{\hat{d_{i,a}}}</math>,<br />
<br />
where <math>\hat{d_k}</math> is the pixelwise distribution at superpixel k, <math> s(k)</math> is the surface of component k. <br />
<br />
In this case, the final labeling for each component <math>k</math> is given by:<br />
<br />
<math>\ l_k=argmax_{a\in classes}{\hat{d_{k,a}}}</math><br />
<br />
<br />
= Results =<br />
<br />
The network was tested on the Stanford Background, SIFT Flow and Barcelona datasets.<br />
<br />
The Stanford Background dataset shows that super-pixels could achieve state of the art results with minimal processing times.<br />
<br />
[[File:stanford_res.png]]<br />
<br />
Since super-pixels were shown to be so effective in the Stanford Dataset, they were the only method of image segmentation used for the SIFT Flow and Barcelona datasets. Instead, exposure of features to the network (whether balanced as super-index 1 or natural as super-index 2) were explored, in conjunction with the aforementioned Graph Based Segmentation method, when combined with the optimal cover algorithm.<br />
<br />
From the sift dataset, it can be seen that the Graph Based Segmentation with optimal cover method offers a significant advantage.<br />
<br />
[[File:sift_res.png]]<br />
<br />
In the Barcelona dataset, it can be seen that a dataset with many labels is too difficult for the CNN.<br />
<br />
[[File:barcelona_res.png]]<br />
<br />
= Conclusions =<br />
<br />
A wide window for contextual information, achieved through the multiscale network, improves the results largely and diminishes the role of the post-processing stage. This allows to replace the computational expensive post-processing with a simpler and faster method (e.g., majority vote) to increase the efficiency without a relevant loss in classification accuracy. The paper has demonstrated that a feed-forward convolutional network, trained end-to-end and fed with raw pixels can produce state of the art performance on scene parsing datasets. The model does not rely on engineered features, and uses purely supervised training from fully-labeled images.<br />
<br />
An interesting find in this paper is that even in the absence of any post-processing, by simply labelling each pixel with highest-scoring category produced by he convolutional net for that location, the system yields near sate-of-the-art pixel-wise accuracy.<br />
<br />
= Future Work =<br />
<br />
Aside from the usual advances to CNN architecture, such as unsupervised pre-training, rectifying non-linearities and local contrast normalization, there would be a significant benefit, especially in datasets with many variables, to have a semantic understanding of the variables. For example, understanding that a window is often part of a building or a car.<br />
<br />
There would also be considerable benefit from improving the metrics used in scene parsing. The current pixel-wise accuracy is a somewhat uninformative measure of the quality of the result. Spotting rare objects is often more important than correctly labeling every boundary pixel of a large region such as the sky. The average per-class accuracy is a step in the right direction, but the authors would prefer a system that correctly spots every object or region, while giving an approximate boundary to a system that produces accurate boundaries for large regions (sky, road, grass, etc), but fails to spot small objects.<br />
<br />
=References=<br />
<references /></div>Rqiao