http://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&feed=atom&action=historymeProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting - Revision history2024-03-28T19:26:16ZRevision history for this page on the wikiMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=31684&oldid=prevJimit: /* Critiques */2017-11-29T00:40:23Z<p><span dir="auto"><span class="autocomment">Critiques</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 20:40, 28 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l163">Line 163:</td>
<td colspan="2" class="diff-lineno">Line 163:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>There has been no mention by the authors on the loss ( significant loss or the insignificance) of using meProp on tasks where preservation of temporal information and contextual data is important. For example, in tasks like using RNNs for Question-Answering tasks, the memory of details of earlier regions of the paragraph could be garbled due to not updating the weights which do not belong to the top-k set in backpropagation. Indeed, the lack of principled methods for sparsification is a major issue in this case since tasks such as machine translation often entail data where certain parts of an input are much more predictive than other parts in a systematic way. There could be a trade-off between knowledge preservation and choice of the hyperparameter k which can be verified by further analysis like correlation/covariance studies.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>There has been no mention by the authors on the loss ( significant loss or the insignificance) of using meProp on tasks where preservation of temporal information and contextual data is important. For example, in tasks like using RNNs for Question-Answering tasks, the memory of details of earlier regions of the paragraph could be garbled due to not updating the weights which do not belong to the top-k set in backpropagation. Indeed, the lack of principled methods for sparsification is a major issue in this case since tasks such as machine translation often entail data where certain parts of an input are much more predictive than other parts in a systematic way. There could be a trade-off between knowledge preservation and choice of the hyperparameter k which can be verified by further analysis like correlation/covariance studies.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The approach can be thought of as a deterministic Dropout giving priority to higher gradient contributing connections during backpropagation. However, unlike dropout (which is random in nature), selecting k-top may permanently exclude some parts of NN from training at all, which has not been mentioned in the paper at all. Authors have also failed to test their approach on bigger datasets such as Imagenet, therefore it might be possible that dataset (MNIST) used by the authors is too simple for the given NN architecture, therefore, meProp approach helped to generalize the model better. It is generally a bad habit to use MNIST results in 2017's research works, as they shed no light on the real world AI problems. The idea is really simple, basically applying only k strongest gradients during backprop which should work for different architectures as well (LSTM, RNNs). This paper has shown the advantage of their method empirically, but only in a simple dataset.It is lacking its results in a real world and more complex dataset.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The approach can be thought of as a deterministic Dropout giving priority to higher gradient contributing connections during backpropagation. However, unlike dropout (which is random in nature), selecting k-top may permanently exclude some parts of NN from training at all, which has not been mentioned in the paper at all. Authors have also failed to test their approach on bigger datasets such as Imagenet, therefore it might be possible that dataset (MNIST) used by the authors is too simple for the given NN architecture, therefore, meProp approach helped to generalize the model better. It is generally a bad habit to use MNIST results in 2017's research works, as they shed no light on the real world AI problems. The idea is really simple, basically applying only k strongest gradients during backprop which should work for different architectures as well (LSTM, RNNs). This paper has shown the advantage of their method empirically, but only in a simple dataset.It is lacking its results in a real world and more complex dataset<ins style="font-weight: bold; text-decoration: none;">. Lastly, the approximate gradient introduced here can be interpreted as a projection of the actual gradient on some lower dimensional subspace. This observation suggests that this method might have some connections with the projected gradient optimization algorithm</ins>.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>=References=</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>=References=</div></td></tr>
</table>Jimithttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=31539&oldid=prevRuifanyu: /* References */2017-11-27T15:55:05Z<p><span dir="auto"><span class="autocomment">References</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 11:55, 27 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l171">Line 171:</td>
<td colspan="2" class="diff-lineno">Line 171:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Journal of Machine Learning Research 15 (2014) 1929-1958</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Journal of Machine Learning Research 15 (2014) 1929-1958</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>#Speech and Language Processing. Daniel Jurafsky & James H. Martin. 2017. Draft of August 28, 2017.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div># Speech and Language Processing. Daniel Jurafsky & James H. Martin. 2017. Draft of August 28, 2017.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Devos. M. R., Orban. G. A. "Self adaptive backpropagation." Proceedings NeuroNimes 1988.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Devos. M. R., Orban. G. A. "Self adaptive backpropagation." Proceedings NeuroNimes 1988.</div></td></tr>
</table>Ruifanyuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=29635&oldid=prevMlhonke: /* Related Work */2017-11-07T17:22:08Z<p><span dir="auto"><span class="autocomment">Related Work</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 13:22, 7 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l29">Line 29:</td>
<td colspan="2" class="diff-lineno">Line 29:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Some of the notable related work to this paper are as follows:</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Some of the notable related work to this paper are as follows:</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In 1990, Tollenaere et al.[1] proposed SuperSAB: an adaptive acceleration strategy for error back propagation learning. It is an improvement on SAB (self-adapting back propogation) strategy. SuperSAB avoids taking a step when a change of sign in the weight derivative is discovered. Instead, it decreases the step size until a safe step is discovered (one without a sign change of the weight). They proved that it may converge orders of magnitude faster than the original back propagation algorithm, and is only slightly unstable. In addition, the algorithm is very insensitive to the choice of parameter values and has excellent scaling properties.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In 1990, Tollenaere et al.[1] proposed SuperSAB: an adaptive acceleration strategy for error back propagation learning. It is an improvement on SAB (self-adapting back propogation) strategy <ins style="font-weight: bold; text-decoration: none;">[7]</ins>. SuperSAB avoids taking a step when a change of sign in the weight derivative is discovered. Instead, it decreases the step size until a safe step is discovered (one without a sign change of the weight). They proved that it may converge orders of magnitude faster than the original back propagation algorithm, and is only slightly unstable. In addition, the algorithm is very insensitive to the choice of parameter values and has excellent scaling properties.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In 1993, Riedmilller et al.[2] and Braun proposed an algorithm called RPROP, to overcome the inherent disadvantages of pure gradient-descent, it performed a local adaptation of the weight-updates according to the behavior of the error function. To be more specific, the authors defined an individual update-value for each weight. When the update of a weight is too large (mathematically, the partial derivative of a weight changes its sign), such update-value of this weight will decrease. Otherwise, it'll increase. After the process of adapting update-values is finished, the weight-update process is carried out as follows: when the partial derivative of the error function with respect to a specific weight is positive, the original weight decreases by its corresponding update-value, otherwise it increases.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In 1993, Riedmilller et al.[2] and Braun proposed an algorithm called RPROP, to overcome the inherent disadvantages of pure gradient-descent, it performed a local adaptation of the weight-updates according to the behavior of the error function. To be more specific, the authors defined an individual update-value for each weight. When the update of a weight is too large (mathematically, the partial derivative of a weight changes its sign), such update-value of this weight will decrease. Otherwise, it'll increase. After the process of adapting update-values is finished, the weight-update process is carried out as follows: when the partial derivative of the error function with respect to a specific weight is positive, the original weight decreases by its corresponding update-value, otherwise it increases.</div></td></tr>
</table>Mlhonkehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=29634&oldid=prevMlhonke: /* References */2017-11-07T17:21:43Z<p><span dir="auto"><span class="autocomment">References</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 13:21, 7 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l172">Line 172:</td>
<td colspan="2" class="diff-lineno">Line 172:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Journal of Machine Learning Research 15 (2014) 1929-1958</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Journal of Machine Learning Research 15 (2014) 1929-1958</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>#Speech and Language Processing. Daniel Jurafsky & James H. Martin. 2017. Draft of August 28, 2017.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>#Speech and Language Processing. Daniel Jurafsky & James H. Martin. 2017. Draft of August 28, 2017.</div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"># Devos. M. R., Orban. G. A. "Self adaptive backpropagation." Proceedings NeuroNimes 1988.</ins></div></td></tr>
</table>Mlhonkehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=29630&oldid=prevMlhonke: /* Related Work */ Adding content to SuperSAB2017-11-07T17:19:32Z<p><span dir="auto"><span class="autocomment">Related Work: </span> Adding content to SuperSAB</span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 13:19, 7 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l29">Line 29:</td>
<td colspan="2" class="diff-lineno">Line 29:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Some of the notable related work to this paper are as follows:</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Some of the notable related work to this paper are as follows:</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In 1990, Tollenaere et al.[1] proposed SuperSAB: an adaptive acceleration strategy for error back propagation learning. They proved that it may converge orders of magnitude faster than the original back propagation algorithm, and is only slightly unstable. In addition, the algorithm is very insensitive to the choice of parameter values and has excellent scaling properties.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In 1990, Tollenaere et al.[1] proposed SuperSAB: an adaptive acceleration strategy for error back propagation learning<ins style="font-weight: bold; text-decoration: none;">. It is an improvement on SAB (self-adapting back propogation) strategy. SuperSAB avoids taking a step when a change of sign in the weight derivative is discovered. Instead, it decreases the step size until a safe step is discovered (one without a sign change of the weight)</ins>. They proved that it may converge orders of magnitude faster than the original back propagation algorithm, and is only slightly unstable. In addition, the algorithm is very insensitive to the choice of parameter values and has excellent scaling properties.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In 1993, Riedmilller et al.[2] and Braun proposed an algorithm called RPROP, to overcome the inherent disadvantages of pure gradient-descent, it performed a local adaptation of the weight-updates according to the behavior of the error function. To be more specific, the authors defined an individual update-value for each weight. When the update of a weight is too large (mathematically, the partial derivative of a weight changes its sign), such update-value of this weight will decrease. Otherwise, it'll increase. After the process of adapting update-values is finished, the weight-update process is carried out as follows: when the partial derivative of the error function with respect to a specific weight is positive, the original weight decreases by its corresponding update-value, otherwise it increases.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In 1993, Riedmilller et al.[2] and Braun proposed an algorithm called RPROP, to overcome the inherent disadvantages of pure gradient-descent, it performed a local adaptation of the weight-updates according to the behavior of the error function. To be more specific, the authors defined an individual update-value for each weight. When the update of a weight is too large (mathematically, the partial derivative of a weight changes its sign), such update-value of this weight will decrease. Otherwise, it'll increase. After the process of adapting update-values is finished, the weight-update process is carried out as follows: when the partial derivative of the error function with respect to a specific weight is positive, the original weight decreases by its corresponding update-value, otherwise it increases.</div></td></tr>
</table>Mlhonkehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=29623&oldid=prevMlhonke: /* Related Work */ Increase previously didn't have subject (it was implied, but not proper sentence structure).2017-11-07T17:10:52Z<p><span dir="auto"><span class="autocomment">Related Work: </span> Increase previously didn't have subject (it was implied, but not proper sentence structure).</span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 13:10, 7 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l31">Line 31:</td>
<td colspan="2" class="diff-lineno">Line 31:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In 1990, Tollenaere et al.[1] proposed SuperSAB: an adaptive acceleration strategy for error back propagation learning. They proved that it may converge orders of magnitude faster than the original back propagation algorithm, and is only slightly unstable. In addition, the algorithm is very insensitive to the choice of parameter values and has excellent scaling properties.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In 1990, Tollenaere et al.[1] proposed SuperSAB: an adaptive acceleration strategy for error back propagation learning. They proved that it may converge orders of magnitude faster than the original back propagation algorithm, and is only slightly unstable. In addition, the algorithm is very insensitive to the choice of parameter values and has excellent scaling properties.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In 1993, Riedmilller et al.[2] and Braun proposed an algorithm called RPROP, to overcome the inherent disadvantages of pure gradient-descent, it performed a local adaptation of the weight-updates according to the behavior of the error function. To be more specific, the authors defined an individual update-value for each weight. When the update of a weight is too large (mathematically, the partial derivative of a weight changes its sign), such update-value of this weight will decrease. Otherwise, it'll increase. After the process of adapting update-values is finished, the weight-update process is carried out as follows: when the partial derivative of the error function with respect to a specific weight is positive, the original weight decreases by its corresponding update-value, <del style="font-weight: bold; text-decoration: none;">increase </del>otherwise. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In 1993, Riedmilller et al.[2] and Braun proposed an algorithm called RPROP, to overcome the inherent disadvantages of pure gradient-descent, it performed a local adaptation of the weight-updates according to the behavior of the error function. To be more specific, the authors defined an individual update-value for each weight. When the update of a weight is too large (mathematically, the partial derivative of a weight changes its sign), such update-value of this weight will decrease. Otherwise, it'll increase. After the process of adapting update-values is finished, the weight-update process is carried out as follows: when the partial derivative of the error function with respect to a specific weight is positive, the original weight decreases by its corresponding update-value, otherwise <ins style="font-weight: bold; text-decoration: none;">it increases</ins>.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In 2014, Srivastava et al.[3] proposed dropout. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In 2014, Srivastava et al.[3] proposed dropout. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.</div></td></tr>
</table>Mlhonkehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=29589&oldid=prevC3zhi: /* Introduction */2017-11-07T15:58:54Z<p><span dir="auto"><span class="autocomment">Introduction</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 11:58, 7 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l3">Line 3:</td>
<td colspan="2" class="diff-lineno">Line 3:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The backpropagation step in neural network training entails high computational cost since each iteration requires calculation of full gradient vectors and matrices and subsequent update of all model parameters.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The backpropagation step in neural network training entails high computational cost since each iteration requires calculation of full gradient vectors and matrices and subsequent update of all model parameters.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The main idea of the paper is to find only a small but critical subset of the gradient information and in each learning step, update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The main idea of the paper is to find only a small but critical subset of the gradient information and in each learning step, update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched.</div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">[[File:20.png|right|650px]] </ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>A simple and effective technique for neural networks learning is introduced in the current paper. The main technique entails a modification to the vanilla backpropagation algorithm. The idea is that after a forward pass has been carried out in the usual fashion, we retain only a subset of the full gradient for computation of model parameters. More precisely, a simple quantization technique is employed to sparsify the gradient vectors, viz., the entries of the first gradient in a backpropagation step are set to zero unless they reach a specified size threshold. The rest of the gradients (the ones with respect to the weights and biases of the neural network) are computed using the chain rule in the typical way using the sparsified gradient obtained from the top layer. Since only a small subset of the weight matrix is modified, we obtain a linear reduction in the computational cost. The experimental results presented in the paper suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>A simple and effective technique for neural networks learning is introduced in the current paper. The main technique entails a modification to the vanilla backpropagation algorithm. The idea is that after a forward pass has been carried out in the usual fashion, we retain only a subset of the full gradient for computation of model parameters. More precisely, a simple quantization technique is employed to sparsify the gradient vectors, viz., the entries of the first gradient in a backpropagation step are set to zero unless they reach a specified size threshold. The rest of the gradients (the ones with respect to the weights and biases of the neural network) are computed using the chain rule in the typical way using the sparsified gradient obtained from the top layer. Since only a small subset of the weight matrix is modified, we obtain a linear reduction in the computational cost. The experimental results presented in the paper suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">[[File:20.png|right|650px]] </del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Fig. 1 shows an abstract view of the proposed approach.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Fig. 1 shows an abstract view of the proposed approach.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
</table>C3zhihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=29588&oldid=prevC3zhi: /* Introduction */2017-11-07T15:58:32Z<p><span dir="auto"><span class="autocomment">Introduction</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 11:58, 7 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l4">Line 4:</td>
<td colspan="2" class="diff-lineno">Line 4:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The main idea of the paper is to find only a small but critical subset of the gradient information and in each learning step, update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The main idea of the paper is to find only a small but critical subset of the gradient information and in each learning step, update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">A simple and effective technique for neural networks learning is introduced in the current paper. The main technique entails a modification to the vanilla backpropagation algorithm. The idea is that after a forward pass has been carried out in the usual fashion, we retain only a subset of the full gradient for computation of model parameters. More precisely, a simple quantization technique is employed to sparsify the gradient vectors, viz., the entries of the first gradient in a backpropagation step are set to zero unless they reach a specified size threshold. The rest of the gradients (the ones with respect to the weights and biases of the neural network) are computed using the chain rule in the typical way using the sparsified gradient obtained from the top layer. Since only a small subset of the weight matrix is modified, we obtain a linear reduction in the computational cost. The experimental results presented in the paper suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[File:20.png|right|650px]] </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[File:20.png|right|650px]] </div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">A simple and effective technique for neural networks learning is introduced in the current paper. The main technique entails a modification to the vanilla backpropagation algorithm. The idea is that after a forward pass has been carried out in the usual fashion, we retain only a subset of the full gradient for computation of model parameters. More precisely, a simple quantization technique is employed to sparsify the gradient vectors, viz., the entries of the first gradient in a backpropagation step are set to zero unless they reach a specified size threshold. The rest of the gradients (the ones with respect to the weights and biases of the neural network) are computed using the chain rule in the typical way using the sparsified gradient obtained from the top layer. Since only a small subset of the weight matrix is modified, we obtain a linear reduction in the computational cost. The experimental results presented in the paper suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).</del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Fig. 1 shows an abstract view of the proposed approach.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Fig. 1 shows an abstract view of the proposed approach.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
</table>C3zhihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=29587&oldid=prevC3zhi: /* Introduction */2017-11-07T15:58:01Z<p><span dir="auto"><span class="autocomment">Introduction</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 11:58, 7 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l4">Line 4:</td>
<td colspan="2" class="diff-lineno">Line 4:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The main idea of the paper is to find only a small but critical subset of the gradient information and in each learning step, update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The main idea of the paper is to find only a small but critical subset of the gradient information and in each learning step, update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>[[File:20.png|right|650px]] <del style="font-weight: bold; text-decoration: none;">Fig. 1 shows an abstract view of the proposed approach.</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>[[File:20.png|right|650px]] </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>A simple and effective technique for neural networks learning is introduced in the current paper. The main technique entails a modification to the vanilla backpropagation algorithm. The idea is that after a forward pass has been carried out in the usual fashion, we retain only a subset of the full gradient for computation of model parameters. More precisely, a simple quantization technique is employed to sparsify the gradient vectors, viz., the entries of the first gradient in a backpropagation step are set to zero unless they reach a specified size threshold. The rest of the gradients (the ones with respect to the weights and biases of the neural network) are computed using the chain rule in the typical way using the sparsified gradient obtained from the top layer. Since only a small subset of the weight matrix is modified, we obtain a linear reduction in the computational cost. The experimental results presented in the paper suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>A simple and effective technique for neural networks learning is introduced in the current paper. The main technique entails a modification to the vanilla backpropagation algorithm. The idea is that after a forward pass has been carried out in the usual fashion, we retain only a subset of the full gradient for computation of model parameters. More precisely, a simple quantization technique is employed to sparsify the gradient vectors, viz., the entries of the first gradient in a backpropagation step are set to zero unless they reach a specified size threshold. The rest of the gradients (the ones with respect to the weights and biases of the neural network) are computed using the chain rule in the typical way using the sparsified gradient obtained from the top layer. Since only a small subset of the weight matrix is modified, we obtain a linear reduction in the computational cost. The experimental results presented in the paper suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> </del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">Fig. 1 shows an abstract view of the proposed approach.</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
</table>C3zhihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting&diff=29586&oldid=prevC3zhi: /* Introduction */2017-11-07T15:57:26Z<p><span dir="auto"><span class="autocomment">Introduction</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 11:57, 7 November 2017</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l4">Line 4:</td>
<td colspan="2" class="diff-lineno">Line 4:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The main idea of the paper is to find only a small but critical subset of the gradient information and in each learning step, update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The main idea of the paper is to find only a small but critical subset of the gradient information and in each learning step, update only this minimal subset of the parameters. This leads to sparsified gradients because only highly relevant parameters are updated and rest remain untouched.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">[[File:20.png|right|650px]] Fig. 1 shows an abstract view of the proposed approach.</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>A simple and effective technique for neural networks learning is introduced in the current paper. The main technique entails a modification to the vanilla backpropagation algorithm. The idea is that after a forward pass has been carried out in the usual fashion, we retain only a subset of the full gradient for computation of model parameters. More precisely, a simple quantization technique is employed to sparsify the gradient vectors, viz., the entries of the first gradient in a backpropagation step are set to zero unless they reach a specified size threshold. The rest of the gradients (the ones with respect to the weights and biases of the neural network) are computed using the chain rule in the typical way using the sparsified gradient obtained from the top layer. Since only a small subset of the weight matrix is modified, we obtain a linear reduction in the computational cost. The experimental results presented in the paper suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>A simple and effective technique for neural networks learning is introduced in the current paper. The main technique entails a modification to the vanilla backpropagation algorithm. The idea is that after a forward pass has been carried out in the usual fashion, we retain only a subset of the full gradient for computation of model parameters. More precisely, a simple quantization technique is employed to sparsify the gradient vectors, viz., the entries of the first gradient in a backpropagation step are set to zero unless they reach a specified size threshold. The rest of the gradients (the ones with respect to the weights and biases of the neural network) are computed using the chain rule in the typical way using the sparsified gradient obtained from the top layer. Since only a small subset of the weight matrix is modified, we obtain a linear reduction in the computational cost. The experimental results presented in the paper suggest that accuracy is improved rather than being degraded. The name given to the proposed technique is minimal effort back propagation method (meProp).</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">[[File:20.png|right|650px]] Fig. 1 shows an abstract view of the proposed approach. </del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"> </ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
</table>C3zhi