http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=As2na&feedformat=atomstatwiki - User contributions [US]2022-01-24T00:22:35ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=38602stat441F182018-11-10T03:49:21Z<p>As2na: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || [https://arxiv.org/pdf/1802.10542.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation#Incremental_Learning Summary] ||<br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going Deeper with Convolutions ||[https://arxiv.org/pdf/1409.4842.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary]<br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM Summary]<br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841F18/ Summary]<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classiﬁcation || [https://arxiv.org/pdf/1408.5882.pdf paper] || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| tba || || <br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song, Yongqi Dong || 11|| Towards Deep Learning Models Resistant to Adversarial Attacks || [https://arxiv.org/pdf/1706.06083.pdf Paper] || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| || ||<br />
|-<br />
|Makeup || Hudson Ash, Stephen Kingston, Richard Zhang, Alexandre Xiao, Ziqiu Zhu || || || ||<br />
|-<br />
|Makeup || Frank Jiang, Yuan Zhang, Jerry Hu || || || ||<br />
|-<br />
|Makeup || Yu Xuan Lee, Tsen Yee Heng || 15 || Gradient Episodic Memory for Continual Learning || [http://papers.nips.cc/paper/7225-gradient-episodic-memory-for-continual-learning.pdf Paper] ||<br />
|-<br />
|Makeup || || || || ||</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38601Memory-Based Parameter Adaptation2018-11-10T03:48:14Z<p>As2na: /* Incremental Learning */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup>.<br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase ==<br />
During the testing phase, the model will temporarily adapt the weights of the output network <math>g_{\theta}</math> based on the input <math>x</math> and the contents of the memory, <math>M</math>, according to<br />
<center><br />
<math><br />
\theta^x = \theta + \Delta_M.<br />
</math><br />
</center><br />
First, <math>x</math> is inputted into the embedding network, <math>q = f_{\gamma}(x)</math>. Based on query <math>q</math>, a K-nearest neighbours search is conducted. The contextual, $C$, is the result of this search.<br />
<center><br />
<math><br />
C = \{(h_k, v_k, w_k^{(x)})\}^K_{k=1}<br />
</math><br />
</center><br />
Each of the neighbours has a weighting <math>w_k^{(x)}</math> attached to it, based on how close it is to query <math>q</math>. This calculation is based on the kernel function,<br />
<center><br />
<math><br />
kern(h,q) = \frac{1}{\epsilon + ||h-q||^2_2}.<br />
</math><br />
</center><br />
The temporary updates during adaptation are based on maximizing the weighted average of the log likelihood over the neighbours in C, also known as the maximum a posteriori over the contextual, <math>C</math>,<br />
<center><br />
<math><br />
\max_{\theta^x} \log p(\theta^x | \theta) + \sum^K_{k=1}w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x). <br />
</math><br />
</center><br />
Note that the first term here acts as regularization that prevents over-fitting. Unfortunately, equation 1 does not have a closed form solution. However, it can be maximized using gradient descent in a fixed number of steps. Each of these steps is calculated via <math>\Delta M</math>,<br />
<center><br />
<math><br />
\Delta_M (x, \theta) = - \alpha_M \nabla_\theta \sum^K_{k=1} w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x)\bigg |_\theta - \beta(\theta - \theta^x), <br />
</math><br />
</center><br />
where <math>\beta</math> is a hyper-parameter of gradient descent. After a series of gradient descent steps, the weights of the final output network <math>g_{\theta}</math> are temporarily adapted and a prediction is made, <math>\hat y</math>.<br />
<br />
[[File:Figure2.PNG|400px|thumb|center|Local fitting on a regression task given a query (blue) and the context from memory (red).<sup>[[#References|[1]]]</sup>.]]<br />
<br />
As can be seen in figure 2, the final prediction <math>\hat y</math> is similar to a weighted average of the values of the K-nearest neighbours.<br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <math>\beta\in(0.0,0.1)</math> and number of steps (n) was <math>n\in[1,20]</math>.<br />
<br />
[[File:ContinualLearning.PNG|400px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <math>f_{\gamma}</math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
Incremental learning has two steps. First, the model is trained on a subset of the classes found in the training data. The second step is to give it the entire training set and see how long it takes for the model to perform well on the entire set. The purpose of this is to see how quickly the model learns information about new classes and how likely it is to lose information about the old ones. The authors used the ImageNet dataset from [[#References|[4]]], and the initial training set contained 500 out of the 1000 classes.<br />
<br />
For the first step, they used three models. A parametric model, MbPA, and a mixture model. The parametric model they used was Resnet V1 from [[#References|[5]]]. It was used both as the parametric model in MbPA and as a separate model for testing. The non-parametric model used was the memory as described earlier. The memory was created by taking the keys from the second last layer of the parametric model. The mixture model was a convex combination of the outputs of the parametric and non-parametric model as shown below:<br />
<br />
<center><br />
<math><br />
p(y|q) = \lambda p_{param}(y|q) + (1-\lambda)p_{mem}(y|q).<br />
</math><br />
</center><br />
<br />
<math>\lambda</math> was tuned as a hyperparameter. Finally, MbPA was used as the fourth model with the Resnet V1 parametric model, and the non-parametric model being identical to the one described above. They were evaluated using their “Top 1” accuracy. That is to say that the class with the highest output value was taken to be the model’s prediction for a given data point in the test set.<br />
<br />
[[File:Figure4.PNG|400px|thumb|center|All three models perform similarly on the data they were pre-trained on. On the new classes, the mixture and parametric models perform similarly and MbPA performs much better<sup>[[#References|[1]]]</sup>.]]<br />
<br />
There was also a test on how well the models perform on unbalanced datasets. In addition to the previous three, they included a non-parametric model which was just the memory running without the rest of the network. Since most real-world datasets have different amounts of data in each class, a model that could use unbalanced datasets without becoming biased would have more information available to it for training. The testing here was done similarly to the other incremental learning experiment. The models were trained on 500 of the 1000 classes until they performed well. They were then given a dataset containing all of the data from the first 500 classes and only 10% of the data from the other 500 classes. Accuracy was evaluated both using Top 1 and AUC (area under the curve) accuracy. It was found that after 0.1 epochs, MbPA and the non-parametric model performed similarly and much better than the other two by both accuracy metrics. After 1 or 3 epochs, the non-parametric model begins to perform worse than the others and MbPA continues to perform better.<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in [[#References|[6]]], though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[1]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[2]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. What learning systems do intelligent agents need? Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[3]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[4]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[5]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deep residual learning for image recognition.IEEE conference on computer vision and pattern recognition, 2016.<br />
<br />
* <sup>[6]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38600Memory-Based Parameter Adaptation2018-11-10T03:48:03Z<p>As2na: /* Continual Learning */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup>.<br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase ==<br />
During the testing phase, the model will temporarily adapt the weights of the output network <math>g_{\theta}</math> based on the input <math>x</math> and the contents of the memory, <math>M</math>, according to<br />
<center><br />
<math><br />
\theta^x = \theta + \Delta_M.<br />
</math><br />
</center><br />
First, <math>x</math> is inputted into the embedding network, <math>q = f_{\gamma}(x)</math>. Based on query <math>q</math>, a K-nearest neighbours search is conducted. The contextual, $C$, is the result of this search.<br />
<center><br />
<math><br />
C = \{(h_k, v_k, w_k^{(x)})\}^K_{k=1}<br />
</math><br />
</center><br />
Each of the neighbours has a weighting <math>w_k^{(x)}</math> attached to it, based on how close it is to query <math>q</math>. This calculation is based on the kernel function,<br />
<center><br />
<math><br />
kern(h,q) = \frac{1}{\epsilon + ||h-q||^2_2}.<br />
</math><br />
</center><br />
The temporary updates during adaptation are based on maximizing the weighted average of the log likelihood over the neighbours in C, also known as the maximum a posteriori over the contextual, <math>C</math>,<br />
<center><br />
<math><br />
\max_{\theta^x} \log p(\theta^x | \theta) + \sum^K_{k=1}w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x). <br />
</math><br />
</center><br />
Note that the first term here acts as regularization that prevents over-fitting. Unfortunately, equation 1 does not have a closed form solution. However, it can be maximized using gradient descent in a fixed number of steps. Each of these steps is calculated via <math>\Delta M</math>,<br />
<center><br />
<math><br />
\Delta_M (x, \theta) = - \alpha_M \nabla_\theta \sum^K_{k=1} w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x)\bigg |_\theta - \beta(\theta - \theta^x), <br />
</math><br />
</center><br />
where <math>\beta</math> is a hyper-parameter of gradient descent. After a series of gradient descent steps, the weights of the final output network <math>g_{\theta}</math> are temporarily adapted and a prediction is made, <math>\hat y</math>.<br />
<br />
[[File:Figure2.PNG|400px|thumb|center|Local fitting on a regression task given a query (blue) and the context from memory (red).<sup>[[#References|[1]]]</sup>.]]<br />
<br />
As can be seen in figure 2, the final prediction <math>\hat y</math> is similar to a weighted average of the values of the K-nearest neighbours.<br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <math>\beta\in(0.0,0.1)</math> and number of steps (n) was <math>n\in[1,20]</math>.<br />
<br />
[[File:ContinualLearning.PNG|400px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <math>f_{\gamma}</math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
Incremental learning has two steps. First, the model is trained on a subset of the classes found in the training data. The second step is to give it the entire training set and see how long it takes for the model to perform well on the entire set. The purpose of this is to see how quickly the model learns information about new classes and how likely it is to lose information about the old ones. The authors used the ImageNet dataset from [[#References|[4]]], and the initial training set contained 500 out of the 1000 classes.<br />
<br />
For the first step, they used three models. A parametric model, MbPA, and a mixture model. The parametric model they used was Resnet V1 from [[#References|[5]]]. It was used both as the parametric model in MbPA and as a separate model for testing. The non-parametric model used was the memory as described earlier. The memory was created by taking the keys from the second last layer of the parametric model. The mixture model was a convex combination of the outputs of the parametric and non-parametric model as shown below:<br />
<br />
<center><br />
<math><br />
p(y|q) = \lambda p_{param}(y|q) + (1-\lambda)p_{mem}(y|q).<br />
</math><br />
</center><br />
<br />
<math>\lambda</math> was tuned as a hyperparameter. Finally, MbPA was used as the fourth model with the Resnet V1 parametric model, and the non-parametric model being identical to the one described above. They were evaluated using their “Top 1” accuracy. That is to say that the class with the highest output value was taken to be the model’s prediction for a given data point in the test set.<br />
<br />
[[File:Figure4.PNG|700px|thumb|center|All three models perform similarly on the data they were pre-trained on. On the new classes, the mixture and parametric models perform similarly and MbPA performs much better<sup>[[#References|[1]]]</sup>.]]<br />
<br />
There was also a test on how well the models perform on unbalanced datasets. In addition to the previous three, they included a non-parametric model which was just the memory running without the rest of the network. Since most real-world datasets have different amounts of data in each class, a model that could use unbalanced datasets without becoming biased would have more information available to it for training. The testing here was done similarly to the other incremental learning experiment. The models were trained on 500 of the 1000 classes until they performed well. They were then given a dataset containing all of the data from the first 500 classes and only 10% of the data from the other 500 classes. Accuracy was evaluated both using Top 1 and AUC (area under the curve) accuracy. It was found that after 0.1 epochs, MbPA and the non-parametric model performed similarly and much better than the other two by both accuracy metrics. After 1 or 3 epochs, the non-parametric model begins to perform worse than the others and MbPA continues to perform better.<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in [[#References|[6]]], though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[1]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[2]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. What learning systems do intelligent agents need? Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[3]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[4]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[5]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deep residual learning for image recognition.IEEE conference on computer vision and pattern recognition, 2016.<br />
<br />
* <sup>[6]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38599Memory-Based Parameter Adaptation2018-11-10T03:47:51Z<p>As2na: /* Testing Phase */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup>.<br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase ==<br />
During the testing phase, the model will temporarily adapt the weights of the output network <math>g_{\theta}</math> based on the input <math>x</math> and the contents of the memory, <math>M</math>, according to<br />
<center><br />
<math><br />
\theta^x = \theta + \Delta_M.<br />
</math><br />
</center><br />
First, <math>x</math> is inputted into the embedding network, <math>q = f_{\gamma}(x)</math>. Based on query <math>q</math>, a K-nearest neighbours search is conducted. The contextual, $C$, is the result of this search.<br />
<center><br />
<math><br />
C = \{(h_k, v_k, w_k^{(x)})\}^K_{k=1}<br />
</math><br />
</center><br />
Each of the neighbours has a weighting <math>w_k^{(x)}</math> attached to it, based on how close it is to query <math>q</math>. This calculation is based on the kernel function,<br />
<center><br />
<math><br />
kern(h,q) = \frac{1}{\epsilon + ||h-q||^2_2}.<br />
</math><br />
</center><br />
The temporary updates during adaptation are based on maximizing the weighted average of the log likelihood over the neighbours in C, also known as the maximum a posteriori over the contextual, <math>C</math>,<br />
<center><br />
<math><br />
\max_{\theta^x} \log p(\theta^x | \theta) + \sum^K_{k=1}w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x). <br />
</math><br />
</center><br />
Note that the first term here acts as regularization that prevents over-fitting. Unfortunately, equation 1 does not have a closed form solution. However, it can be maximized using gradient descent in a fixed number of steps. Each of these steps is calculated via <math>\Delta M</math>,<br />
<center><br />
<math><br />
\Delta_M (x, \theta) = - \alpha_M \nabla_\theta \sum^K_{k=1} w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x)\bigg |_\theta - \beta(\theta - \theta^x), <br />
</math><br />
</center><br />
where <math>\beta</math> is a hyper-parameter of gradient descent. After a series of gradient descent steps, the weights of the final output network <math>g_{\theta}</math> are temporarily adapted and a prediction is made, <math>\hat y</math>.<br />
<br />
[[File:Figure2.PNG|400px|thumb|center|Local fitting on a regression task given a query (blue) and the context from memory (red).<sup>[[#References|[1]]]</sup>.]]<br />
<br />
As can be seen in figure 2, the final prediction <math>\hat y</math> is similar to a weighted average of the values of the K-nearest neighbours.<br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <math>\beta\in(0.0,0.1)</math> and number of steps (n) was <math>n\in[1,20]</math>.<br />
<br />
[[File:ContinualLearning.PNG|700px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <math>f_{\gamma}</math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
Incremental learning has two steps. First, the model is trained on a subset of the classes found in the training data. The second step is to give it the entire training set and see how long it takes for the model to perform well on the entire set. The purpose of this is to see how quickly the model learns information about new classes and how likely it is to lose information about the old ones. The authors used the ImageNet dataset from [[#References|[4]]], and the initial training set contained 500 out of the 1000 classes.<br />
<br />
For the first step, they used three models. A parametric model, MbPA, and a mixture model. The parametric model they used was Resnet V1 from [[#References|[5]]]. It was used both as the parametric model in MbPA and as a separate model for testing. The non-parametric model used was the memory as described earlier. The memory was created by taking the keys from the second last layer of the parametric model. The mixture model was a convex combination of the outputs of the parametric and non-parametric model as shown below:<br />
<br />
<center><br />
<math><br />
p(y|q) = \lambda p_{param}(y|q) + (1-\lambda)p_{mem}(y|q).<br />
</math><br />
</center><br />
<br />
<math>\lambda</math> was tuned as a hyperparameter. Finally, MbPA was used as the fourth model with the Resnet V1 parametric model, and the non-parametric model being identical to the one described above. They were evaluated using their “Top 1” accuracy. That is to say that the class with the highest output value was taken to be the model’s prediction for a given data point in the test set.<br />
<br />
[[File:Figure4.PNG|700px|thumb|center|All three models perform similarly on the data they were pre-trained on. On the new classes, the mixture and parametric models perform similarly and MbPA performs much better<sup>[[#References|[1]]]</sup>.]]<br />
<br />
There was also a test on how well the models perform on unbalanced datasets. In addition to the previous three, they included a non-parametric model which was just the memory running without the rest of the network. Since most real-world datasets have different amounts of data in each class, a model that could use unbalanced datasets without becoming biased would have more information available to it for training. The testing here was done similarly to the other incremental learning experiment. The models were trained on 500 of the 1000 classes until they performed well. They were then given a dataset containing all of the data from the first 500 classes and only 10% of the data from the other 500 classes. Accuracy was evaluated both using Top 1 and AUC (area under the curve) accuracy. It was found that after 0.1 epochs, MbPA and the non-parametric model performed similarly and much better than the other two by both accuracy metrics. After 1 or 3 epochs, the non-parametric model begins to perform worse than the others and MbPA continues to perform better.<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in [[#References|[6]]], though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[1]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[2]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. What learning systems do intelligent agents need? Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[3]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[4]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[5]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deep residual learning for image recognition.IEEE conference on computer vision and pattern recognition, 2016.<br />
<br />
* <sup>[6]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38598Memory-Based Parameter Adaptation2018-11-10T03:47:10Z<p>As2na: /* Testing Phase */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup>.<br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase ==<br />
During the testing phase, the model will temporarily adapt the weights of the output network <math>g_{\theta}</math> based on the input <math>x</math> and the contents of the memory, <math>M</math>, according to<br />
<center><br />
<math><br />
\theta^x = \theta + \Delta_M.<br />
</math><br />
</center><br />
First, <math>x</math> is inputted into the embedding network, <math>q = f_{\gamma}(x)</math>. Based on query <math>q</math>, a K-nearest neighbours search is conducted. The contextual, $C$, is the result of this search.<br />
<center><br />
<math><br />
C = \{(h_k, v_k, w_k^{(x)})\}^K_{k=1}<br />
</math><br />
</center><br />
Each of the neighbours has a weighting <math>w_k^{(x)}</math> attached to it, based on how close it is to query <math>q</math>. This calculation is based on the kernel function,<br />
<center><br />
<math><br />
kern(h,q) = \frac{1}{\epsilon + ||h-q||^2_2}.<br />
</math><br />
</center><br />
The temporary updates during adaptation are based on maximizing the weighted average of the log likelihood over the neighbours in C, also known as the maximum a posteriori over the contextual, <math>C</math>,<br />
<center><br />
<math><br />
\max_{\theta^x} \log p(\theta^x | \theta) + \sum^K_{k=1}w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x). <br />
</math><br />
</center><br />
Note that the first term here acts as regularization that prevents over-fitting. Unfortunately, equation 1 does not have a closed form solution. However, it can be maximized using gradient descent in a fixed number of steps. Each of these steps is calculated via <math>\Delta M</math>,<br />
<center><br />
<math><br />
\Delta_M (x, \theta) = - \alpha_M \nabla_\theta \sum^K_{k=1} w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x)\bigg |_\theta - \beta(\theta - \theta^x), <br />
</math><br />
</center><br />
where <math>\beta</math> is a hyper-parameter of gradient descent. After a series of gradient descent steps, the weights of the final output network <math>g_{\theta}</math> are temporarily adapted and a prediction is made, <math>\hat y</math>.<br />
<br />
[[File:Figure2.PNG|700px|thumb|center|Local fitting on a regression task given a query (blue) and the context from memory (red).<sup>[[#References|[1]]]</sup>.]]<br />
<br />
As can be seen in figure 2, the final prediction <math>\hat y</math> is similar to a weighted average of the values of the K-nearest neighbours.<br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <math>\beta\in(0.0,0.1)</math> and number of steps (n) was <math>n\in[1,20]</math>.<br />
<br />
[[File:ContinualLearning.PNG|700px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <math>f_{\gamma}</math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
Incremental learning has two steps. First, the model is trained on a subset of the classes found in the training data. The second step is to give it the entire training set and see how long it takes for the model to perform well on the entire set. The purpose of this is to see how quickly the model learns information about new classes and how likely it is to lose information about the old ones. The authors used the ImageNet dataset from [[#References|[4]]], and the initial training set contained 500 out of the 1000 classes.<br />
<br />
For the first step, they used three models. A parametric model, MbPA, and a mixture model. The parametric model they used was Resnet V1 from [[#References|[5]]]. It was used both as the parametric model in MbPA and as a separate model for testing. The non-parametric model used was the memory as described earlier. The memory was created by taking the keys from the second last layer of the parametric model. The mixture model was a convex combination of the outputs of the parametric and non-parametric model as shown below:<br />
<br />
<center><br />
<math><br />
p(y|q) = \lambda p_{param}(y|q) + (1-\lambda)p_{mem}(y|q).<br />
</math><br />
</center><br />
<br />
<math>\lambda</math> was tuned as a hyperparameter. Finally, MbPA was used as the fourth model with the Resnet V1 parametric model, and the non-parametric model being identical to the one described above. They were evaluated using their “Top 1” accuracy. That is to say that the class with the highest output value was taken to be the model’s prediction for a given data point in the test set.<br />
<br />
[[File:Figure4.PNG|700px|thumb|center|All three models perform similarly on the data they were pre-trained on. On the new classes, the mixture and parametric models perform similarly and MbPA performs much better<sup>[[#References|[1]]]</sup>.]]<br />
<br />
There was also a test on how well the models perform on unbalanced datasets. In addition to the previous three, they included a non-parametric model which was just the memory running without the rest of the network. Since most real-world datasets have different amounts of data in each class, a model that could use unbalanced datasets without becoming biased would have more information available to it for training. The testing here was done similarly to the other incremental learning experiment. The models were trained on 500 of the 1000 classes until they performed well. They were then given a dataset containing all of the data from the first 500 classes and only 10% of the data from the other 500 classes. Accuracy was evaluated both using Top 1 and AUC (area under the curve) accuracy. It was found that after 0.1 epochs, MbPA and the non-parametric model performed similarly and much better than the other two by both accuracy metrics. After 1 or 3 epochs, the non-parametric model begins to perform worse than the others and MbPA continues to perform better.<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in [[#References|[6]]], though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[1]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[2]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. What learning systems do intelligent agents need? Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[3]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[4]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[5]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deep residual learning for image recognition.IEEE conference on computer vision and pattern recognition, 2016.<br />
<br />
* <sup>[6]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38597Memory-Based Parameter Adaptation2018-11-10T03:45:13Z<p>As2na: /* Testing Phase */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup>.<br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase ==<br />
During the testing phase, the model will temporarily adapt the weights of the output network <math>g_{\theta}</math> based on the input <math>x</math> and the contents of the memory, <math>M</math>, according to<br />
<center><br />
<math><br />
\theta^x = \theta + \Delta_M.<br />
</math><br />
</center><br />
First, <math>x</math> is inputted into the embedding network, <math>q = f_{\gamma}(x)</math>. Based on query <math>q</math>, a K-nearest neighbours search is conducted. The contextual, $C$, is the result of this search.<br />
<center><br />
<math><br />
C = \{(h_k, v_k, w_k^{(x)})\}^K_{k=1}<br />
</math><br />
</center><br />
Each of the neighbours has a weighting <math>w_k^{(x)}</math> attached to it, based on how close it is to query <math>q</math>. This calculation is based on the kernel function,<br />
<center><br />
<math><br />
kern(h,q) = \frac{1}{\epsilon + ||h-q||^2_2}.<br />
</math><br />
</center><br />
The temporary updates during adaptation are based on maximizing the weighted average of the log likelihood over the neighbours in C, also known as the maximum a posteriori over the contextual, <math>C</math>,<br />
<center><br />
<math><br />
\max_{\theta^x} \log p(\theta^x | \theta) + \sum^K_{k=1}w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x). <br />
</math><br />
</center><br />
Note that the first term here acts as regularization that prevents over-fitting. Unfortunately, equation 1 does not have a closed form solution. However, it can be maximized using gradient descent in a fixed number of steps. Each of these steps is calculated via <math>\Delta M</math>,<br />
<center><br />
<math><br />
\Delta_M (x, \theta) = - \alpha_M \nabla_\theta \sum^K_{k=1} w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x)\bigg |_\theta - \beta(\theta - \theta^x), <br />
</math><br />
</center><br />
where <math>\beta</math> is a hyper-parameter of gradient descent. After a series of gradient descent steps, the weights of the final output network <math>g_{\theta}</math> are temporarily adapted and a prediction is made, <math>\hat y</math>.<br />
<br />
[[File:Figure2.PNG|700px|thumb|center|Local fitting on a regression task given a query (blue) and the context from memory (red).<sup>[[#References|[1]]]</sup>.]]<br />
<br />
As can be seen in figure 2, the final prediction <math>\hat y</math> is similar to a weighted average of the values of the K-nearest neighbours.<br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <math>\beta\in(0.0,0.1)</math> and number of steps (n) was <math>n\in[1,20]</math>.<br />
<br />
[[File:ContinualLearning.PNG|700px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <math>f_{\gamma}</math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
Incremental learning has two steps. First, the model is trained on a subset of the classes found in the training data. The second step is to give it the entire training set and see how long it takes for the model to perform well on the entire set. The purpose of this is to see how quickly the model learns information about new classes and how likely it is to lose information about the old ones. The authors used the ImageNet dataset from [[#References|[4]]], and the initial training set contained 500 out of the 1000 classes.<br />
<br />
For the first step, they used three models. A parametric model, MbPA, and a mixture model. The parametric model they used was Resnet V1 from [[#References|[5]]]. It was used both as the parametric model in MbPA and as a separate model for testing. The non-parametric model used was the memory as described earlier. The memory was created by taking the keys from the second last layer of the parametric model. The mixture model was a convex combination of the outputs of the parametric and non-parametric model as shown below:<br />
<br />
<center><br />
<math><br />
p(y|q) = \lambda p_{param}(y|q) + (1-\lambda)p_{mem}(y|q).<br />
</math><br />
</center><br />
<br />
<math>\lambda</math> was tuned as a hyperparameter. Finally, MbPA was used as the fourth model with the Resnet V1 parametric model, and the non-parametric model being identical to the one described above. They were evaluated using their “Top 1” accuracy. That is to say that the class with the highest output value was taken to be the model’s prediction for a given data point in the test set.<br />
<br />
[[File:Figure4.PNG|700px|thumb|center|All three models perform similarly on the data they were pre-trained on. On the new classes, the mixture and parametric models perform similarly and MbPA performs much better<sup>[[#References|[1]]]</sup>.]]<br />
<br />
There was also a test on how well the models perform on unbalanced datasets. In addition to the previous three, they included a non-parametric model which was just the memory running without the rest of the network. Since most real-world datasets have different amounts of data in each class, a model that could use unbalanced datasets without becoming biased would have more information available to it for training. The testing here was done similarly to the other incremental learning experiment. The models were trained on 500 of the 1000 classes until they performed well. They were then given a dataset containing all of the data from the first 500 classes and only 10% of the data from the other 500 classes. Accuracy was evaluated both using Top 1 and AUC (area under the curve) accuracy. It was found that after 0.1 epochs, MbPA and the non-parametric model performed similarly and much better than the other two by both accuracy metrics. After 1 or 3 epochs, the non-parametric model begins to perform worse than the others and MbPA continues to perform better.<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in [[#References|[6]]], though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[1]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[2]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. What learning systems do intelligent agents need? Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[3]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[4]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[5]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deep residual learning for image recognition.IEEE conference on computer vision and pattern recognition, 2016.<br />
<br />
* <sup>[6]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38596Memory-Based Parameter Adaptation2018-11-10T03:42:02Z<p>As2na: /* Testing Phase */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup>.<br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase ==<br />
During the testing phase, the model will temporarily adapt the weights of the output network $g_{\theta}$ based on the input $x$ and the contents of the memory, $M$, according to<br />
\begin{equation*}<br />
\theta^x = \theta + \Delta_M.<br />
\end{equation*}<br />
First, $x$ is inputted into the embedding network, $q = f_{\gamma}(x)$. Based on query $q$, a K-nearest neighbours search is conducted. The contextual, $C$, is the result of this search.<br />
\begin{equation*}<br />
C = \{(h_k, v_k, w_k^{(x)})\}^K_{k=1}<br />
\end{equation*}<br />
Each of the neighbours has a weighting $w_k^{(x)}$ attached to it, based on how close it is to query $q$. This calculation is based on the kernel function,<br />
\begin{equation*}<br />
\textnormal{kern}(h,q) = \frac{1}{\epsilon + ||h-q||^2_2}.<br />
\end{equation*}<br />
<br />
The temporary updates during adaptation are based on maximizing the weighted average of the log likelihood over the neighbours in C, also known as the maximum a posteriori over the contextual, <math>C</math>,<br />
<br />
<center><br />
<math><br />
\max_{\theta^x} \log p(\theta^x | \theta) + \sum^K_{k=1}w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x). <br />
</math><br />
</center><br />
<br />
Note that the first term here acts as regularization that prevents over-fitting. Unfortunately, equation 1 does not have a closed form solution. However, it can be maximized using gradient descent in a fixed number of steps. Each of these steps is calculated via <math>\Delta M</math>,<br />
<br />
<center><br />
<math><br />
\Delta_M (x, \theta) = - \alpha_M \nabla_\theta \sum^K_{k=1} w_k^{(x)} \log p(v^{(x)}_k | h_k^{(x)}, \theta^x,x)\bigg |_\theta - \beta(\theta - \theta^x), <br />
</math><br />
</center><br />
<br />
where <math>\beta</math> is a hyper-parameter of gradient descent. After a series of gradient descent steps, the weights of the final output network <math>g_{\theta}</math> are temporarily adapted and a prediction is made, <math>\hat y</math>.<br />
<br />
[[File:Figure2.PNG|700px|thumb|center|Local fitting on a regression task given a query (blue) and the context from memory (red).<sup>[[#References|[1]]]</sup>.]]<br />
<br />
As can be seen in figure 2, the final prediction <math>\hat y</math> is similar to a weighted average of the values of the K-nearest neighbours.<br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <math>\beta\in(0.0,0.1)</math> and number of steps (n) was <math>n\in[1,20]</math>.<br />
<br />
[[File:ContinualLearning.PNG|700px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <math>f_{\gamma}</math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
Incremental learning has two steps. First, the model is trained on a subset of the classes found in the training data. The second step is to give it the entire training set and see how long it takes for the model to perform well on the entire set. The purpose of this is to see how quickly the model learns information about new classes and how likely it is to lose information about the old ones. The authors used the ImageNet dataset from [[#References|[4]]], and the initial training set contained 500 out of the 1000 classes.<br />
<br />
For the first step, they used three models. A parametric model, MbPA, and a mixture model. The parametric model they used was Resnet V1 from [[#References|[5]]]. It was used both as the parametric model in MbPA and as a separate model for testing. The non-parametric model used was the memory as described earlier. The memory was created by taking the keys from the second last layer of the parametric model. The mixture model was a convex combination of the outputs of the parametric and non-parametric model as shown below:<br />
<br />
<center><br />
<math><br />
p(y|q) = \lambda p_{param}(y|q) + (1-\lambda)p_{mem}(y|q).<br />
</math><br />
</center><br />
<br />
<math>\lambda</math> was tuned as a hyperparameter. Finally, MbPA was used as the fourth model with the Resnet V1 parametric model, and the non-parametric model being identical to the one described above. They were evaluated using their “Top 1” accuracy. That is to say that the class with the highest output value was taken to be the model’s prediction for a given data point in the test set.<br />
<br />
[[File:Figure4.PNG|700px|thumb|center|All three models perform similarly on the data they were pre-trained on. On the new classes, the mixture and parametric models perform similarly and MbPA performs much better<sup>[[#References|[1]]]</sup>.]]<br />
<br />
There was also a test on how well the models perform on unbalanced datasets. In addition to the previous three, they included a non-parametric model which was just the memory running without the rest of the network. Since most real-world datasets have different amounts of data in each class, a model that could use unbalanced datasets without becoming biased would have more information available to it for training. The testing here was done similarly to the other incremental learning experiment. The models were trained on 500 of the 1000 classes until they performed well. They were then given a dataset containing all of the data from the first 500 classes and only 10% of the data from the other 500 classes. Accuracy was evaluated both using Top 1 and AUC (area under the curve) accuracy. It was found that after 0.1 epochs, MbPA and the non-parametric model performed similarly and much better than the other two by both accuracy metrics. After 1 or 3 epochs, the non-parametric model begins to perform worse than the others and MbPA continues to perform better.<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in [[#References|[6]]], though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[1]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[2]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. What learning systems do intelligent agents need? Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[3]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[4]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[5]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deep residual learning for image recognition.IEEE conference on computer vision and pattern recognition, 2016.<br />
<br />
* <sup>[6]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38583Memory-Based Parameter Adaptation2018-11-09T23:56:20Z<p>As2na: /* Conclusion */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup> <br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase == <br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <math>\beta\in(0.0,0.1)</math> and number of steps (n) was <math>n\in[1,20]</math>.<br />
<br />
[[File:ContinualLearning.PNG|700px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <math>f_{\gamma}</math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in [[#References|[6]]], though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[[1]]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[[2]]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. Whatlearning systems do intelligent agents need?Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[[3]]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[[4]]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[[5]]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deepresidual learning for image recognition.IEEE conference on computervision and pattern recognition, 2016.<br />
<br />
* <sup>[[6]]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38582Memory-Based Parameter Adaptation2018-11-09T23:53:37Z<p>As2na: /* Continual Learning */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup> <br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase == <br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <math>\beta\in(0.0,0.1)</math> and number of steps (n) was <math>n\in[1,20]</math>.<br />
<br />
[[File:ContinualLearning.PNG|700px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <math>f_{\gamma}</math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling among others have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in <sup>[[#References|[6]]]</sup>, though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[[1]]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[[2]]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. Whatlearning systems do intelligent agents need?Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[[3]]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[[4]]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[[5]]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deepresidual learning for image recognition.IEEE conference on computervision and pattern recognition, 2016.<br />
<br />
* <sup>[[6]]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation&diff=38581Memory-Based Parameter Adaptation2018-11-09T23:52:23Z<p>As2na: /* Continual Learning */</p>
<hr />
<div>This is a summary based on the paper, Memory-based Parameter Adaptation by Sprechmann et al.<sup>[[#References|[1]]]</sup> <br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
= Presented by = <br />
*J.Walton<br />
*J.Schneider<br />
*Z.Abbas<br />
*A.Na<br />
<br />
= Introduction = <br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences<sup>[[#References|[2]]]</sup>. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
= Model Architecture = <br />
[[File:MbPA_model_architecture.PNG|700px|thumb|center|Architecture for the MbPA model. Left: Training Usage. Right: Testing Setting.]]<br />
<br />
== Training Phase == <br />
<br />
The model consists of three components: an embedding network <math>f_{\gamma}</math>, a memory <math>M</math> and an output network <math>g_{\theta}</math>. The embedding network and the output network can be thought of as the standard feedforward neural networks for our purposes, with parameters (weights) <math>\gamma</math> and <math>\theta</math>, respectively. The memory, denoted by <math>M</math>, stores “experiences” in the form of key and value pairs <math>\{(h_{i},v_{i})\}</math> where the keys <math>h_{i}</math> are the outputs of the embedding network <math>f_{\gamma}(x_{i})</math> and the values <math>v_{i}</math>, in the context of classification, are simply the true class labels <math>y_{i}</math>. Thus, for a given input <math>x_{j}</math><br />
<br />
<center><br />
<math><br />
f_{\gamma}(x_{j}) \rightarrow h_{j},<br />
</math><br />
</center><br />
<br />
<center><br />
<math><br />
y_{j} \rightarrow v_{j}.<br />
</math><br />
</center> <br />
<br />
Note that the memory has a fixed size; thus when it is full, the oldest data is discarded first.<br />
<br />
During training, the authors sample of a set of <math>b</math> training examples randomly (ie. mini-batch size <math>b</math>), say <math>\{(x_{b},y_{b})\}_{b}</math>, from the training data that they input into the embedding network <math>f_{\gamma}</math>, followed by the output network <math>g_{\theta}</math>. The parameters of the embedding and output networks are updated by maximizing the likelihood function (equivalently, minimizing the loss function) of the target values<br />
<br />
<center><br />
<math><br />
p(y|x,\gamma,\theta)=g_{\theta}(f_{\gamma}(x)).<br />
</math><br />
</center><br />
<br />
The last layer of the output network <math>g_{\theta}</math> is a softmax layer, such that the output can be interpreted as a probability distribution. This process is also known as backpropagation with mini-batch gradient descent. Finally, the embedded samples <math>\{(f_{\gamma}(x_{b}),y_{b})\}_{b}</math> are stored into the memory. No local adaptation takes place during this phase.<br />
<br />
== Testing Phase == <br />
<br />
= Examples =<br />
<br />
== Continual Learning ==<br />
Continual learning is the process of learning multiple tasks in a sequence without revisiting a task. The authors consider a permuted MNIST setup, similar to [[#References|[3]]], where each task was given by a different permutation of the pixels. The authors sequentially trained the MbPA on 20 different permutations and tested on previously trained tasks.<br />
<br />
The model was trained on 10 000 examples per task, using a 2 layer multi-layer perceptron (MLP) with an ADAM optimizer. The elastic weight consolidation (EWC) method and regular gradient descent were used to estimate the parameters. A grid search was used to determine the EWC penalty cost and the local MbPA learning rate was set as <\math>\beta\in(0.0,0.1)<\sup> and number of steps (n) was <\math>n\in[1,20]<\math>.<br />
<br />
[[File:ContinualLearning.PNG|700px|thumb|center|Results on baseline comparisons on permuted MNIST<br />
with MbPA using different memory sizes.]]<br />
<br />
The authors used the pixels as the embedding, i.e. <\math>f_{\gamma}<\math> is the identity function, and looked at regions where episodic memory was small. The authors found that through MbPA only a few gradient steps on carefully selected data from memory is enough to recover performance. They found that MbPA outperformed MLP and worked better than EWC in most cases and found that the performance of MbPA grew with the number of examples stored. They note that the memory requirements were lower than EWC. The lower memory requirements are attributed to the fact that EWC stores all task identifiers, whereas MbPA only stores a few examples. The figure above also shows the results of MbPA combined with other methods. It is noted that MbPA combined with EWC gives the best results.<br />
<br />
== Incremental Learning ==<br />
<br />
= Conclusion =<br />
<br />
The MbPA model can successfully overcome several shortcomings associated with neural networks through its non-parametric, episodic memory. In fact, many other works in the context of classification and language modelling among others have successfully used variants of this architecture, where traditional neural network systems are augmented with memories. Likewise, the experiments in incremental and continual learning presented in this paper use a memory architecture similar to the Differential Neural Dictionary (DND) used in Neural Episodic Control (NEC) found in <sup>[[#References|[6]]]</sup>, though the gradients from the memory in the MbPA model are not used during training. In conclusion, MbPA presents a natural way to improve the performance of standard deep networks.<br />
<br />
=References=<br />
* <sup>[[1]]</sup>Sprechmann. Pablo, Jayakumar. Siddhant, Rae. Jack, Pritzel. Alexander,Badia. Adria, Uria. Benigno, Vinyals. Oriol, Hassabis. Demis, Pascanu.Razvan, and Blundell. Charles. Memory-based parameter adaptation.ICLR, 2018.<br />
<br />
* <sup>[[2]]</sup>Kumaran. Dhushan, Hassabis. Demis, and McClelland. James. Whatlearning systems do intelligent agents need?Trends in Cognitive Sciences,2016.<br />
<br />
* <sup>[[3]]</sup>Goodfellow. Ian, Warde-Farley. David, Mirza. Mehdi, Courville. Aaron,and Bengio. Yohsua. Maxout networks.arXiv preprint, 2013.<br />
<br />
* <sup>[[4]]</sup>Russakovsky. Olga, Deng. Jia, Su. Hao, Krause. Jonathan, Satheesh. San-jeev, Ma. Sean, Huang. Zhiheng, Karpathy. Andrej, Khosla. Aditya, andBernstein. Michael. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 2015.<br />
<br />
* <sup>[[5]]</sup>He. Kaiming, Zhang. Xiangyu, Ren. Shaoqing, and Sun. Jian. Deepresidual learning for image recognition.IEEE conference on computervision and pattern recognition, 2016.<br />
<br />
* <sup>[[6]]</sup>Pritzel. Alexander, Uria. Benigno, Srinivasan. Sriram, Puigdomenech.Adria, Vinyals. Oriol, Hassabis. Demis, Wierstra. Daan, and Blundell.Charles. Neural episodic control.ICML, 2017.</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ContinualLearning.PNG&diff=38580File:ContinualLearning.PNG2018-11-09T23:52:10Z<p>As2na: </p>
<hr />
<div></div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=37908stat441F182018-11-06T00:53:24Z<p>As2na: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || [https://arxiv.org/pdf/1802.10542.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/images/0/0f/MbPA_Summary.pdf Summary] ||<br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going deeper with convolutions ||[https://arxiv.org/pdf/1409.4842.pdf paper] || <br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797] || ||<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classiﬁcation || [https://arxiv.org/pdf/1408.5882.pdf paper] || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| tba || || <br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song || 11|| TBA || || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| || ||<br />
|-<br />
|Makeup || Hudson Ash, Stephen Kingston, Richard Zhang, Alexandre Xiao, Ziqiu Zhu || || || ||<br />
|-<br />
|Makeup || || || || ||<br />
|-<br />
|Makeup || || || || ||<br />
|-<br />
|Makeup || || || || ||</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:As2na&diff=37906User:As2na2018-11-06T00:52:10Z<p>As2na: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences, Kumaran, 2016. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
[[File:Figure1.PNG]]</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure1.PNG&diff=37905File:Figure1.PNG2018-11-06T00:51:42Z<p>As2na: As2na uploaded a new version of File:Figure1.PNG</p>
<hr />
<div></div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure4.PNG&diff=37903File:Figure4.PNG2018-11-06T00:50:41Z<p>As2na: </p>
<hr />
<div></div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure3.PNG&diff=37902File:Figure3.PNG2018-11-06T00:50:31Z<p>As2na: </p>
<hr />
<div></div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure2.PNG&diff=37901File:Figure2.PNG2018-11-06T00:50:19Z<p>As2na: </p>
<hr />
<div></div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure1.PNG&diff=37900File:Figure1.PNG2018-11-06T00:50:07Z<p>As2na: </p>
<hr />
<div></div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:As2na&diff=37899User:As2na2018-11-06T00:49:47Z<p>As2na: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
The paper generalizes some approaches in language modelling that seek to overcome some of the shortcomings of neural networks including the phenomenon of catastrophic forgetting using memory-based adaptation. Catastrophic forgetting occurs when neural networks perform poorly on old tasks after they have been trained to perform well on a new task. The paper also presents experimental results where the model in question is applied to continual and incremental learning tasks.<br />
<br />
Model-based parameter adaptation (MbPA) is based on the theory of complementary learning systems which states that intelligent agents must possess two learning systems, one that allows the gradual acquisition of knowledge and another that allows rapid learning of the specifics of individual experiences, Kumaran, 2016. Similarly, MbPA consists of two components: a parametric component and a non-parametric component. The parametric component is the standard neural network which learns slowly (low learning rates) but generalizes well. The non-parametric component, on the other hand, is a neural network augmented with an episodic memory that allows storing of previous experiences and local adaptation of the weights of the parametric component. The parametric and non-parametric components therefore serve different purposes during the training and testing phases.<br />
<br />
[[File:Figure1.png]]</div>As2nahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=37843stat441F182018-11-05T22:18:52Z<p>As2na: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || [https://arxiv.org/pdf/1802.10542.pdf Paper] || <br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going deeper with convolutions ||[https://arxiv.org/pdf/1409.4842.pdf paper] || <br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797] || ||<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classiﬁcation || [https://arxiv.org/pdf/1408.5882.pdf paper] || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| tba || || <br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song || 11|| TBA || || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| || ||<br />
|-<br />
|Makeup || || || || ||<br />
|-<br />
|Makeup || || || || ||<br />
|-<br />
|Makeup || || || || ||<br />
|-<br />
|Makeup || || || || ||</div>As2na