# Gradient Episodic Memory for Continual Learning

## Presented by

• Yu Xuan Lee
• Tsen Yee Heng

## Background and Introduction

Supervised learning consist of a training set $D_{tx}=(x_i,y_i)^n_{i=1}$, where $x_i \in \mathcal{X}$ and $y_i \in \mathcal{Y}$. Empirical Risk Minimization (ERM) is one of the common supervised learning method used to minimize a loss function by having multiple passes over the training set.

$\frac{1}{|D_{tr}|}\textstyle \sum_{(x_i,y_i) \in D_{tr}} \ell (f(x_i),y_i)$

where $\ell :\mathcal {Y} \times \mathcal {Y} \to [0, \infty)$

Different to machine learning, datas are being observed sequentially, occurred recurrently, and stored limitedly for learning humans. Thus, the iid assumption is not applicable. One of the characteristics of ERM is "catastrophic forgetting", which is the problem of recalling past knowledge upon acquiring new ones. To overcome this problem, Gradient Episodic Memory (GEM) is introduced to alleviates forgetting on previous acquired knowledge, while solving new problems more efficiently.

## Framework for Continual Learning

The feature vector $x_i \in \mathcal{X}_t$, task descriptor $t_i \in \mathcal{T}$, and target vector $y_i \in \mathcal{Y}_t$ are the three main components of a continuum of data. Note that the continuum is locally iid where for every $(x_i, t_i, y_i)$

$(x_i,y_i) \overset{iid}{\sim} P_{t_i}(X,Y)$

The main mathematical purpose of continual learning is to obtain $f: \mathcal{X} \times \mathcal{Y}$ where a target vector $y$ must be inquired using a test pair $(x,t)$.

Task descriptors are structured objects, describing how to solve each $i$-th task. They are integers $t_i=i \in \mathbb{Z}$ which occurs in a collection where $t_1,...,t_n \in \mathcal{T}$. Most importantly, they could distinguish every same input $x_i$ that have different target. To conclude, task descriptors plays the part of carrying crucial information of the example and distinguishing different learning environment for similar examples.

### Training Protocol

The target setting for continual learning are as follow:

• Small quantity of training examples for each task
• Examples for each tasks being observed only once
• Outcome of transfer and forgetting being concluded

To perform this, each example were only given once to the learner in one at a time in sequence. In this case, learner gets information in $(x_i,t_i,y_i)$ form with no duplication.

### Evaluation Metrics

The capability of transferring knowledge across tasks are very important in addition to results across each tasks. First of all, transferring knowledge are categorized as follow:

• Backward transfer (BWT) This is the difference of judgement towards previously encountered task $k$ after learning new task $t$, noted as $k \prec t$. Within backward transfer, there are two categories, positive backward transfer and negative backward transfer. Positive backward transfer shows a better judgement towards previously encountered task $k$ after learning new task $t$. Contrarily, negative backward transfer shows the opposite. Also, do note that catastrophic forgetting happens due to extensive negative backward transfer.
• Forward transfer (FWT) Opposite to BWT, FWT shows judgement towards new task $t$ after learning task $k$, noted as $k \succ t$. Positive forward transfer is one way of forward transfer.

Given a test set of $T$, we would learn task $t_i$ and observe its performance towards all $T$ tasks. A matrix $R_{i,j}$ as test classification accuracy of the model on task $t_j$ after observing the last sample from task $t_j$ is constructed, where $R \in \mathbb{R} ^{T \times T}$. Note that $\bar b\$ is the vector of test accuracies for each task at random initialization. The function for Average Accuracy (ACC), Backward Transfer (BWT) and Forward Transfer (FWT) are shown below:

$ACC = \frac{1}{T} \sum_{i=1}^T R_{T,i}$

$BWT = \frac{1}{T-1} \sum_{i=1}^{T-1} R_{T,i} - R_{i,i}$

$FWT = \frac{1}{T-1} \sum_{i=2}^{T} R_{i-1,i- \bar b\ _i}$

Note that if ACC happens to be similar for both models, model with higher BWT and FWT values are more desired.

Episodic memory $M_t$ is very important in GEM, it contains information on examples on task $t$ and it is indicated from the integer task descriptors. So practically, we would minimize catastrophic forgetting by using the episodic memory efficiently. Note that learner is assumed to have limited memory locations $M$. Hence, the amount located for each task is calculated as $m=\frac{M}{T}$. To calculate the loss of memories from the $k$-th task, assuming predictors $f_ \theta$ parameterized by $\theta \in \mathbb{R} ^p$, we have the following equation:

$\ell (f_\theta, \mathcal{M}_k)=\frac{1}{|\mathcal{M}_k|} \sum_{(x_i,k,y_i) \in \mathcal{M}_k} \ell(f_ \theta (x_i,k),y_i)$

The above equation will be treated as inequality constraint and a decrease in the equation would be in favour instead of increase. So we would use $(x,t,y)$ to minimize the following equation:

$mimimize_\theta \space \space \ell(f_\theta(x,t),y)$
$subject\space to \space \space \ell (f_\theta,\mathcal{M}_k) \le \ell(f_\theta^{t-1},\mathcal{M}_k) \space\space for \space all \space k\lt t$

where $f_\theta^{t-1}$ is the predictor state at the end of learning of task $t-1$.

To efficiently solve the above equation, three ideas are proposed:

• Unnecessary to store old predictors $f_\theta^{t-1}$.
• Functions are locally linear.
• Loss of previous tasks could be calculated using the angle between loss gradient vector and proposed update.

With the above ideas, the loss function is further improved as follow:

$\langle g,g_k \rangle := \langle \frac{\partial \ell(f_\theta(x,t),y)}{\partial \theta}, \frac{\partial \ell(f_\theta,\mathcal{M}_k)}{\partial \theta} \rangle \ge 0, \space for \space all \space k\lt t.$

However, if there is at least one violation in the equality constraint, we would overcome this by projecting the gradient $g$ to the closest gradient $\tilde{g}$ satisfying all the constraints. The optimization problem becomes

$minimize_{ \tilde{g} } \space \space \frac{1}{2}\parallel g - \tilde{g} \parallel _2^2$
$subject \space to \space \space \langle \tilde{g},g_k \rangle \ge 0 \space \space for \space all \space k\lt t$

Therefore, the primal GEM Quadratic Program (QP) is

$minimize_z \space \space \frac{1}{2}z^Tz - g^Tz+\frac{1}{2}g^Tg$
$subject \space to \space \space Gz \ge 0,$

Dual of the GEM QP is

$minimize_v \space \space \frac{1}{2}v^TGG^Tv + g^TG^Tv$
$subject \space to \space \space v\ge 0$

By solving $v^*$, we could obtain the projected gradient update $\tilde{g}=G^Tv^* + g$. The algorithm is as follow:

## Experiment

We perform a variety of experiments to assess the performance of GEM in continual learning.

### Dataset

We consider the following dataset:

• MNIST Permutations. A group of datas consist of handwritten digits. It is transformed by pixels.
• MNIST Rotation A group of datas consist of handwritten digits rotated by a fixed angle between 0 and 180 degree.
• CIFAR100 A group of CIFAR dataset with 100 classes, which each task introduces a new set of classes. For a total of T tasks, each new task concerns examples from a disjoint subset of 100/T classes. Input distribution are similar but output distribution required different.

For all the datasets, we consider T = 20 tasks. On the MNIST dataset, there are 1000 samples for each task from 10 different classes and 2500 examples from 5 different classes for the CIFAR100 datasets. The tasks are observed in sequence and the example are going through by once only.

### Architectures

A neural network with two hidden layers of 100 ReLU units is used on this model. We use a smaller version of ResNet18 in the CIFAR100, with three times less feature maps across all layers. This is one simple way to leverage the task descriptor, in order to adapt the output distribution to the subset of classes for each task. Plain SGD is being used on mini batches of 10 samples in order to train the network. All hyper-parameter are optimized through grid-search, and the best result for each model are being listed.

### Methods

We compare GEM to 5 alternatives:

• A single predictor trained across all tasks.
• One independent predictor is being used for each task. Each of them has the same architecture as "single" but with T times less hidden units than "single".
• A multimodal predictor with a dedicated input layer per task (only for MNIST).
• EWC, where the loss is regularized to avoid catastrophic forgetting.
• iCARL, a class-incremented learner that classifies using a nearest exemplar algorithm, and using an episodic memory to prevent catastrophic forgetting. iCARL method only applied to our experiment on CIFAR100.

Notes : GEM, EWC, iCARL has the same architecture as "single", with episodic memory.

### Results

Figure 1 (left) shows the average accuracy, BWT and FWT. Overall, GEM performs similarly or better than multimodal model. GEM minimized backward transfer, while exhibiting negligible or positive forward transfer.

Figure 1(right) shows the evolution of the test accuracy of the first task throughout the continuum of data. GEM exhibits minimal forgetting, and achieve positive BWT in CIFAR100.

Overall, GEM performs slightly better than other continuing method like EWC, while spending less computational. GEM efficiency comes from optimizing over a number of variables equal to the number of tasks, instead of optimizing over a number of variables equal to the number of parameters.

Table 2 shows the final ACC for both GEM and iCARL in the dataset CIFAR100, as a function of their episodic memory size. As we can see from table 2, GEM outperform iCARL for a wide range of memory sizes.

Table 3 shows that the importance of memory as we do more than one pass through the data on the MNIST Rotation experiment. As we can see from table 3, GEM and EWC (memory-based) has a higher ACC as the number of passes through the data increases, and memory-less method causes higher negative BWT, which has lower accuracy. By comparing the first and last row of table 3, GEM matches the condition of “oracle performance upper-bound” ACC in iid learning, and minimizes negative BWT, which has higher ACC.

## Related Work

Continual learning is a method that keep the knowledge from the past task and apply them to acquire new skills through all the task. In this work, we use continuing learning to focus more on the realistic setting where examples are only seen once. We introduced the GEM, which outperforms every model in limiting forgetting.

There are some ways to avoid catastrophic forgetting in which one of those ways is to freeze the early layer in the neural network and duplicate the later layers on new tasks. However, it is hard to use these methods and there are too many modules and tasks. We want to use a single model and modify the learning objective to avoid catastrophic forgetting. There is some other method that use synaptic memory, which minimized the important parameters for the previous task and there is another method which use episodic memory stored and replayed the examples from previous tasks to prevent forgetting and allow positive backward transfer.

## Conclusion

The scenario of continuing learning is being formalized. Firstly, the training and evaluation protocol are being defined to assess the quality of models in term of their accuracy, and the ability to transfer knowledge with BWT and FWT. Secondly, GEM is introduced as a model which leverage the episodic memory to get a positive BWT and avoid forgetting efficiently. We demonstrate the effectiveness of GEM by comparing GEM to other model such as EWC, multimodal, etc.

However, there is some improvement has to be done to GEM. Firstly, GEM might obtain positive FWT because it do not leverage structured task descriptor. Secondly, there are too many complicated structures in advance memory management which we did not investigate. Thirdly, the iteration of GEM increase the computational time.

## Reference

• [1] David Lopez-Paz and Marc'Aurelio Ranzato. Gradient Episodic Memory for Continual Learning.