# Functional regularisation for continual learning with gaussian processes

Meixi Chen

## Introduction

Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularize Continual Learning (CL) so that it doesn't forget previously learnt tasks. This method, referred to as functional regularization for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. Then the posterior belief is utilized in a optimization as a regularizer to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.

## Previous Work

There are two types of methods that have been widely used in Continual Learning.

### Replay/Rehearsal Methods

This type of methods stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimization of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.

### Regularization-based Methods

These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in a hope to regularize the learning of new tasks. Two important methods are Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL), both of which make model parameters adaptive to new tasks while regularizing weights by prior knowledge from the earlier tasks. Nonetheless, with long sequences of tasks this might still result in an increased forgetting of earlier tasks.

## Comparison between the Proposed Method and Previous Methods

### Comparison to replay/rehearsal methods

Similarity: It also stores data from earlier tasks.

Difference: Instead of storing a subset of data, it stores a set of inducing points, which can be optimized using criteria from GP literature [2] [3] [4].

### Comparison to regularization-based methods

Similarity: It is also based on approximate Bayesian inference by using a prior distribution that regularizes the model updates.

Difference: It constrains the neural network on the space of functions rather than weights by making use of Gaussian processes (GP).

## Recap of the Gaussian Process

Definition: A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].

The Gaussian process is a non-parametric approach as it can be viewed as an infinte-dimensional generalization of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimization in the function space. A Gaussian process over a prediction function $f(\boldsymbol{x})$ can be completely specified by its mean function and covariance function (or kernel function), $\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))$ Note that in practice the mean function is typically taken to be 0 because we can always write $f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})$ where $g(\boldsymbol{x})$ follows a GP with 0 mean. Hence, the GP is characterized by its kernel function.

In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by $\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).$ When we only observe finitely many $\boldsymbol{x}$, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrized by the kernel function.

Note: Throughout this summary, $\mathcal{GP}$ refers the the distribution of functions, and $\mathcal{N}$ refers to the distribution of finite random variables.

A One-dimensional Example of the Gaussian Process

In the figure below, the red dashed line represents the underlying true function $f(x)$ and the red dots are the observation taken from this function. The blue solid line indicates the predicted function $\hat{f}(x)$ given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.

## Methods

Consider a deep neural network in which the final hidden layer provides the feature vector $\phi(x;\theta)\in \mathbb{R}^K$, where $x$ is the input data and $\theta$ are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter $\theta$. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.

For each specific task $i$, an output layer is constructed as $f_i(x;w_i) = w_i^T\phi(x;\theta)$, where $w_i$ is the task-specific weight. By assuming that the weight $w_i$ follows a normal distribution $w_i\sim \mathcal{N}(0, \sigma_w^2I)$, we obtain a distribution over functions: $f_i(x) \sim \mathcal{GP}(0, k(x,x')),$ where $k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)$. We can express our posterior belief over $f_i(x)$ instead of $w_i$. Namely, we are interested in estimating

$\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),$ where $X_i = \{x_{i,j}\}_{j=1}^{N_i}$ are input vectors and $\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}$ are output targets so that each $y_{i,j}$ is assigned to the input $x_{i,j} \in R^D$. However, in practice the following approxiation is used:

$\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),$ Instead of having fixed model weight $w_i$, we now have a distribution for it, which depends on the input data. Then we can summarize information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing $\mathcal{O}(N_i^2)$ parameters and keeping in memory the full set of input vector $X_i$. To see this, note that the $\Sigma_i$ is a $N_i \times N_i$ matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.

Inducing Points: $Z_i = \{z_{i,j}\}_{j=1}^{M_i}$, which can be a subset of $X_i$ (the $i$-th training inputs) or points lying between the training inputs.

Auxiliary function: $\boldsymbol{u}_i$, where $u_{i,j} = f(z_{i,j})$.

We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace $\boldsymbol{f}_i$ by the auxiliary function $\boldsymbol{u}_i$ evaluated at the inducing inputs $Z_i$. Intuitively, we are also assuming the inducing inputs $Z_i$ contain enough information to make inference about the "true" $\boldsymbol{f}_i$, so we can replace $X_i$ by $Z_i$.

Now we can introduce how to learn the first task when no previous knowledge has been acquired.

In learning the first task, the goal is to generate the first posterior belief given this task: $p(\boldsymbol{u}_1|\text{Data})$. We learn this distribution by approximating it by a variational distribution: $q(\boldsymbol{u}_1)$. That is, $p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)$. We can parametrize $q(\boldsymbol{u}_1)$ as $\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)$, where $L_{u_1}$ is the lower triangular Cholesky factor. I.e., $\Sigma_{u_1}=L_{u_1}L_{u_1}^T$. Next, we introduce how to estimate $q(\boldsymbol{u}_1)$, or equivalently, $\mu_{u_1}$ and $L_{u_1}$, using variational inference.

Given the first task with data $(X_1, \boldsymbol{y}_1)$, we can use a variational distribution $q(\boldsymbol{f}_1, \boldsymbol{u}_1)$ that approximates the exact posterior distribution $p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)$, where $q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)$ $p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).$ Note that we use $p_\theta(\cdot)$ to denote the Gaussian distribution form with kernel parametrized by a common $\theta$.

Hence, we can jointly learn $q(\boldsymbol{u}_1)$ and $\theta$ by minimizing the KL divergence $\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),$ which is equivalent to maximizing the Evidence Lower Bound (ELBO) $\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).$

After learning the first task, we only keep the inducing points $Z_1$ and the parameters of $q(\boldsymbol{u}_1)$, both of which act as a task summary of the first task. Note that $\theta$ also has been updated based on the first task. In learning the $k$-th task, we can use the posterior belief $q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})$ obtained from the preceding tasks to regularize the learning, and produce a new task summary $(Z_k, q(\boldsymbol{u}_k))$. Similar to the first task, now the objective function to be maximized is $\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]- \text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}$

As a result, we regularize the learning of a new task by the sum of KL divergence terms $\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))$, where each $q(\boldsymbol{u}_i)$ encodes the knowledge about an earlier task $i \lt k$. To make the optimization computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularization term.

### Alternative Inference for the Current Task

Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief $q(\boldsymbol{u}_k)$. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation $q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})$ by letting $f_k(x; w_k) = w_k^T \phi(x; \theta)$, $w_k \sim \mathcal{N}(0, \sigma_w^2 I)$. The objective becomes $\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]- \text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}$

After learning $\mu_{w_k}$ and $\Sigma_{w_k}$, we can also compute the posterior distribution over their function values $\boldsymbol{u}_k$ according to $q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T$), where $\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}$, $L_{u_k}=\Phi_{Z_k}L_{w_k}$, and $\Phi_{Z_k}$ stores as rows the feature vectors evaluated at $Z_k$.

The figure below is a depiction of the proposed method.

### Selection of the Inducing Points

In practice, a simple but effective way to select inducing points is to select a random set $Z_k$ of the training inputs $X_k$. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix $K_{X_k}$ can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional $p(\boldsymbol{f}_k|\boldsymbol{u}_k)$: $\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),$ where $K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)$, and $K(\cdot, \cdot)$ is the kernel function parametrized by $\theta$. This method promotes finding inducing points $Z_k$ that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. The round dots represents the data points and each color corresponds to a different label.

### Prediction

Given a test data point $x_{i,*}$, we can obtain the predictive density function of its output $y_{i,*}$ given by \begin{align*} p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\ &= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*}, \end{align*} where $q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)$ with known mean and variance \begin{align*} \mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\ \sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*} \end{align*} Note that all the terms in $\mu_{i,*}$ and $\sigma_{i,*}^2$ are either already estimated or depend on some estimated parameters.

It is important to emphasize that the mean $\mu_{i,*}$ can be further rewritten as $\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)$, which, notably, depends on $\theta$. This means that the expectation of $f_{i,*}$ changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights $w_i$, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.

In the previous discussion, we have assumed the task boundaries are known, but in the practical setting this assumption is often unrealistic. Therefore, the authors introduced a way to detect task boundaries using the GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by $\ell_i$, which can be interpreted as a degree of surprise about $x_i$ - the smaller is $\ell_i$ the more surprising is $x_i$. Before making any updates to the parameter, we can perform a statistical test between the values $\{\ell_i\}_{i=1}^b$ for the current batch and those from the previous batch $\{\ell_i^{old}\}_{i=1}^b$. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.

The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small $\ell_i$ corresponds to a task switch.

## Experiments

The authors aimed to answer three questions:

1. How does FRCL compare to state-of-the-art algorithms for Continual Learning?
2. How does the criterion for inducing point selection affect accuracy?
3. When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?

### Comparison to state-of-the-art algorithms

The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimized using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.

### Comparison of different criteria for inducing points selection

As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.

### Efficacy in detecting task boundaries

We can observe from the following figure that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.

## Conclusions

The proposed method unifies both the regularization-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages in both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:

• The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.
• A structured criterion for selecting inducing points becomes increasingly important with decreasing number of inducing points.
• The method is able to detect task boundary changes when they are not given.

Future work can include enforcing a fixed memory buffer where the summary of all previous seen tasks is compressed into one summary. Also, it would be interesting to investigate the application of the proposed method to other domains such as reinforcement leaning.

## Critiques

This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: $q(\boldsymbol{u}_1)$ and $p_\theta(u_1)$. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:

The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. To mitigate these challenges, in this paper authors propose a subset of training data namely "Inducing Points". They proposed to choose Inducing Points either at random or based on an optimization scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:

Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery. Advances in Neural Information Processing Systems. 2012.

More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.

Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.

Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.

Finally, the number of regularization terms $\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))$ growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularization terms can make optimization more difficult.

## References

[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.

[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.

[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.

[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574.

[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.