Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

From statwiki
Jump to navigation Jump to search

Introduction & Background

Learning quickly is a hallmark of human intelligence, whether it involves recognizing objects from a few examples or quickly learning new skills after just minutes of experience. In this work, we propose a meta-learning algorithm that is general and model-agnostic, in the sense that it can be directly applied to any learning problem and model that is trained with a gradient descent procedure. Our focus is on deep neural network models, but we illustrate how our approach can easily handle different architectures and different problem settings, including classification, regression, and policy gradient reinforcement learning, with minimal modification. Unlike prior meta-learning methods that learn an update function or learning rule (Schmidhuber, 1987; Bengio et al., 1992; Andrychowicz et al., 2016; Ravi & Larochelle, 2017), this algorithm does not expand the number of learned parameters nor place constraints on the model architecture (e.g. by requiring a recurrent model (Santoro et al., 2016) or a Siamese network (Koch, 2015)), and it can be readily combined with fully connected, convolutional, or recurrent neural networks. It can also be used with a variety of loss functions, including differentiable supervised losses and nondifferentiable reinforcement learning objectives.

The primary contribution of this work is a simple model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task. The paper shows the effectiveness of the proposed algorithm in different domains, including classification, regression, and reinforcement learning problems.

Model-Agnostic Meta Learning (MAML)

The goal of the proposed model is rapid adaptation. This setting is usually formalized as few-shot learning.

Problem set-up

The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few datapoints and training iterations. To do so. the model is trained during a meta-learning phase on a set of tasks, such that it can then be adapted to a new task using only a small number of parameter updates. In effect, the meta-learning problem treats entire tasks as training examples.

Let us consider a model denoted by $f$, that maps the observation $\mathbf{x}$ into the output variable $a$. During meta-learning, the model is trained to be able to adapt to a large or infinite number of tasks.

Let us consider a generic notion of task as below. Each task $\mathcal{T} = \{\mathcal{L}(\mathbf{x}_1.a_1,\mathbf{x}_2,a_2,..., \mathbf{x}_H,a_H), q(\mathbf{x}_1),q(\mathbf{x}_{t+1}|\mathbf{x}_t),H \}$, consists of a loss function $\mathcal{L}$, a distribution over initial observations $q(\mathbf{x}_1)$, a transition distribution $q(\mathbf{x}_{t+1}|\mathbf{x}_t)$, and an episode length $H$. In i.i.d. supervised learning problems, the length $H =1$. The model may generate samples of length $H$ by choosing an output at at each time $t$. The cost $\mathcal{L}$ provides a task-specific feedback, which is defined based on the nature of the problem.

A distribution over tasks is denoted by $p(\mathcal{T})$. In the K-shot learning setting, the model is trained to learn a new task $\mathcal{T}_i$ drawn from $p(\mathcal{T})$ from only K samples drawn from $q_i$ and feedback $\mathcal{L}_{\mathcal{T}_i}$ generated by $\mathcal{T}_i$. During meta-training, a task $\mathcal{T}_i$ is sampled from $p(\mathcal{T})$, the model is trained with K samples and feedback from the corresponding loss LTi from Ti, and then tested on new samples from Ti. The model f is then improved by considering how the test error on new data from $q_i$ changes with respect to the parameters. In effect, the test error on sampled tasks $\mathcal{T}_i$ serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from $p(\mathcal{T})$, and meta-performance is measured by the model’s performance after learning from K samples.


MAML Algorithm

Figure 1: Diagram of the MAML algorithm

The paper proposes a method that can learn the parameters of any standard model via meta-learning in such a way as to prepare that model for fast adaptation. The intuition behind this approach is that some internal representations are more transferrable than others. Since the model will be fine-tuned using a gradient-based learning rule on a new task, we will aim to learn a model in such a way that this gradient-based learning rule can make rapid progress on new tasks drawn from $p(\mathcal{T})$, without overfitting. In effect, we will aim to find model parameters that are sensitive to changes in the task, such that small changes in the parameters will produce large improvements on the loss function of any task drawn from $p(\mathcal{T})$, see Fig 1.

Note that there is no assumption about the form of the model. Only assumption is that it is parameterized by a vector of parameters $\theta$, and the loss is smooth so that the parameters can be leaned using gradient-based techniques. Formally lets assume that the model is denoted by $f_{\theta}$. When adapting to a new task $\mathcal{T}_i $, the model’s parameters $\theta$ become $\theta_i'$. In our method, the updated parameter vector $\theta_i'$ is computed using one or more gradient descent updates on task $\mathcal{T}_i $. For example, when using one gradient update:

$$ \theta_i ' = \theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta}). $$

Here $\alpha$ is a the learning rate of each task and considered as a hyperparameter. They consider a single step of update for the rest of the paper, for the sake of the simplicity.

The model parameters are trained by optimizing for the performance of $f_{\theta_i'}$ with respect to $\theta$ across tasks sampled from $p(\mathcal{T})$. More concretely, the meta-objective is as follows:

$$ \min_{\theta} \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} $$

Different Types of MAML

Experiments

Conclusion & Critique

References

  1. Schmidhuber, J¨urgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.