Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "='''Introduction & Background'''= border|right|400px This paper describes a framework for learning composable deep subpolicies in a multitask setting. These...")
 
 
(113 intermediate revisions by 18 users not shown)
Line 1: Line 1:
='''Introduction & Background'''=
='''Introduction & Background'''=
[[File:MRL0.png|border|right|400px]]
Learning quickly is a hallmark of human intelligence, whether it involves recognizing objects from a few examples or quickly learning new skills after just minutes of experience. Meta-learning is a subfield of machine learning where automatic learning algorithms are applied on meta-data about machine learning experiments. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In this work, we propose a meta-learning algorithm that is general and model-agnostic, in the sense that it can be directly applied to any learning problem and model that is trained with a gradient descent procedure. Our focus is on deep neural network models, but we illustrate how our approach can easily handle different architectures and different problem settings, including classification, regression, and policy gradient reinforcement learning, with minimal modification. Unlike prior meta-learning methods that learn an update function or learning rule [1,2,3,4], this algorithm does not expand the number of learned parameters nor place constraints on the model architecture (e.g. by requiring a recurrent model [5] or a Siamese network [6]), and it can be readily combined with fully connected, convolutional, or recurrent neural networks. It can also be used with a variety of loss functions, including differentiable supervised losses and nondifferentiable reinforcement learning objectives.
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.


This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.
The primary contribution of this work is a simple model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task. The paper shows the effectiveness of the proposed algorithm in different domains, including classification, regression, and reinforcement learning problems.


The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.
==Key Idea==
Arguably, the biggest success story of transfer learning has been initializing vision network weights using pre-trained ImageNet. In particular, when approaching any new vision task, the well-known paradigm is to first collect labeled data for the task, acquire a network pre-trained on ImageNet classification, and then fine-tune the network on the collected data using gradient descent. Using this approach, neural networks can more effectively learn new image-based tasks from modestly-sized datasets. However, pre-training does not go very far, because, the last layers of the network still need to be heavily adapted to the new task, datasets that are too small, as in the few-shot setting, will still cause severe overfitting. Furthermore, we unfortunately don’t have an analogous pre-training scheme for non-vision domains such as speech, language, and control.


The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions:
What if we directly optimized for an initial representation that can be effectively fine-tuned from a small number of examples? This is exactly the idea behind this paper, model-agnostic meta-learning (MAML). Like other meta-learning methods, MAML trains over a wide range of tasks. It trains for a representation that can be quickly adapted to a new task, via a few gradient steps. The meta-learner seeks to find an initialization that is not only useful for adapting to various problems but also can be adapted quickly (in a small number of steps) and efficiently (using only a few examples).
# Learning the full collection of tasks jointly via reinforcement learning
# In a zero-shot setting where a policy sketch is available for a held-out task
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.


The code has been released at http://github.com/jacobandreas/psketch.
The key idea underlying this method is to train the model’s initial parameters such that the model has maximal performance on a new task after the parameters have been updated through one or more gradient steps computed with a small amount of data from that new task. This can be viewed from a feature learning standpoint as building an internal representation that is broadly suitable for many tasks. If the internal representation is suitable for many tasks, simply fine-tuning the parameters slightly (e.g. by primarily modifying the top layer weights in a feedforward model) can produce good results.


='''Learning Modular Policies from Sketches'''=
='''Model-Agnostic Meta Learning (MAML)'''=
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with
The goal of the proposed model is the rapid adaptation, which means learning a new function from only a few input/output pairs for that task, using prior data from similar tasks for meta-learning. This setting is usually formalized as few-shot learning.
* $S$ a set of states
* $A$ a set of low-level actions
* $P : S \times A \times S \to R$ a transition probability distribution
* $\gamma$ a discount factor


Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.
=== Problem set-up ===
The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few data points and training iterations. To do so, the model is trained during a meta-learning phase on a set of tasks, such that it can then be adapted to a new task using only a small number of parameter updates. In effect, the meta-learning problem treats entire tasks as training examples.  


==Model==
Let us consider a model denoted by $f$, that maps the observation  $\mathbf{x}$ into the output variable $a$. During meta-learning, the model is trained to be able to adapt to a large or infinite number of tasks.  
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.


[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]
Let us consider a generic notion of task as below. Each task $\mathcal{T} = \{\mathcal{L}(\mathbf{x}_1,a_1,\mathbf{x}_2,a_2,..., \mathbf{x}_H,a_H), q(\mathbf{x}_1),q(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t),H \}$, consists of a loss function $\mathcal{L}$, a distribution over initial observations $q(\mathbf{x}_1)$, a transition distribution $q(\mathbf{x}_{t+1}|\mathbf{x}_t)$, and an episode length $H$. In i.i.d. supervised learning problems,
the length $H =1$. The model may generate samples of length $H$ by choosing an output at at each time $t$. The cost $\mathcal{L}$ provides a task-specific feedback, which is defined based on the nature of the problem.


At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.
A distribution over tasks is denoted by $p(\mathcal{T})$. In the K-shot learning setting, the model is trained to learn a new task $\mathcal{T}_i$ drawn from $p(\mathcal{T})$ from only K samples drawn from $q_i$ and feedback $\mathcal{L}_{\mathcal{T}_i}$ generated by $\mathcal{T}_i$. During meta-training, a task  $\mathcal{T}_i$ is sampled from $p(\mathcal{T})$, the model is trained with K samples and feedback from the corresponding loss $\mathcal{L}(\mathcal{T}_i)$ from $\mathcal{T}_i$, and then tested on new samples from $T_i$. The model $f$ is then improved by considering how the test error on new data from $q_i$ changes with respect to the parameters. In effect, the test error on sampled tasks $\mathcal{T}_i$ serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from $p(\mathcal{T})$, and meta-performance is measured by the model’s performance after learning from K samples. Notice that tasks used for meta-testing are held out during meta-training.


In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).
=== MAML Algorithm ===
[[File:model.png|200px|right|thumb|Figure 1: Diagram of the MAML algorithm]]
The paper proposes a method that can learn the parameters of any standard model via meta-learning in such a way as to prepare that model for fast adaptation. The intuition behind this approach is that some internal representations are more transferable than others. Since the model will be fine-tuned using a gradient-based learning rule on a new task, we will aim to learn a model in such a way that this gradient-based learning rule can make rapid progress on new tasks drawn from $p(\mathcal{T})$, without overfitting. In effect, we will aim to find model parameters that are sensitive to changes in the task, such that small changes in the parameters will produce large improvements on the loss function of any task drawn from $p(\mathcal{T})$. Fig. 1 is a visualization of MMAML algorithm – suppose we are seeking to find a set of parameters $\theta$ that are highly adaptable. During the course of meta-learning (the bold line), MAML optimizes for a set of parameters such that when a gradient step is taken with respect to a particular task $i$ (the gray lines), the parameters are close to the optimal parameters $θ^∗_i$ for task $i$.


Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:
Note that there is no assumption about the form of the model. The only assumption is that it is parameterized by a vector of parameters $\theta$, and the loss is smooth so that the parameters can be leaned using gradient-based techniques. Formally let us assume that the model is denoted by $f_{\theta}$. When adapting
[[File:MRL1.png|center|border|]]
to a new task $\mathcal{T}_i $, the model’s parameters $\theta$ become $\theta_i'$. In our method, the updated parameter vector $\theta_i'$ is computed using one or more gradient descent updates on task $\mathcal{T}_i $. For example, when using one gradient update:


Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward
$$
[[File:MRL2.png|center|border|]]
\theta_i ' = \theta - \alpha \nabla_{\theta }\mathcal{L}_{\mathcal{T}_i}(f_{\theta}).
across all tasks $t \in T$.
$$


==Policy Optimization==
Here $\alpha$ is the learning rate (or the step size) of each task and considered as a hyperparameter. They consider a single step of an update for the rest of the paper, for the sake of the simplicity.


Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:
The model parameters are trained by optimizing for the performance
[[File:MRL3.png|center|border|]]
of $f_{\theta_i'}$ with respect to $\theta$ across tasks sampled from $p(\mathcal{T})$. More concretely, the meta-objective is as follows:  


where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.
$$
[[File:MRL4.png|frame|]]
\min_{\theta} \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'})  = \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta - \alpha \nabla_{\theta }\mathcal{L}_{\mathcal{T}_i}(f_{\theta})})
$$


In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to  have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that  
Note that the meta-optimization is performed over the model parameters $\theta$, whereas the objective is computed using the updated model parameters $\theta'$. The model aims to optimize the model parameters such that one or a small number of gradient step on a new task will produce maximally effective behavior on that task.
[[File:MRL5.png|center|border|]]
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:
[[File:MRL6.png|center|border|]]
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.


Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by
Therefore the meta-learning across the tasks is performed via stochastic gradient descent (SGD), such that the model parameters $\theta$ are updated as:
[[File:MRL7.png|center|border|]]
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.


The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.
$$
\theta \gets \theta - \beta \nabla_{\theta } \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'})
$$
where $\beta$ is the meta step size. Outline of the algorithm is shown in Algorithm 1.  
[[File:ershad_alg1.png|500px|center|thumb]]


==Curriculum Learning==
The MAML meta-gradient update involves a gradient through a gradient. Computationally, this requires an additional backward pass through f to compute Hessian-vector products, which is supported by standard deep learning libraries such as TensorFlow.


For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.
='''Different Types of MAML'''=
In this section, the MAML algorithm is discussed for different supervised learning and reinforcement learning tasks. The differences between each of these tasks are in their loss function and the way the data is generated. In general, this method does not require additional model parameters nor using any additional meta-learner to learn the update of parameters. Compared to other approaches that tend to “learn to compare new examples in a learned metric space using e.g. Siamese networks or recurrence with attention mechanisms”, the proposed method can be generalized to any other problems including classification, regression and reinforcement learning.  


To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.
=== Supervised Regression and Classification ===
Few-shot learning is well-studied in this field.  For these two types of tasks, the horizon $H$ is equal to 1, since the model accepts a single input and produces a single output, rather than a sequence of inputs and outputs. The task ${\mathcal{T}_i}$ generates $K$ i.i.d. observations $x$ from $q_i$, and the task loss is represented by the error between the model’s output for x and the corresponding target values y for that observation and task


Intuitively, the tasks that provide the strongest learning signal are those in which
Although any common classification and regression objectives can be used as the loss, the paper uses the following losses for these two tasks.  
# The agent does not on average achieve reward close to the upper bound
# Many episodes result in high reward.


The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.
Regression : For regression we use the mean-square error (MSE):


The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.
$$
\mathcal{L}_{\mathcal{T}_i} (f_{\theta})  =  \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \parallel  f_{\theta} (\mathbf{x}^{(j)}) - \mathbf{y}^{(j)}\parallel_2^2
$$
 
where $\mathbf{x}^{(j)}$ and $\mathbf{y}^{(j)}$ are the input/output pair sampled from task $\mathcal{T}_i$. In K-shot regression tasks, K input/output pairs are provided for learning for each task.
 
Classification: For classification we use the cross entropy loss:
 
$$
\mathcal{L}_{\mathcal{T}_i} (f_{\theta})  =  \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \mathbf{y}^{(j)} \log f_{\theta}(\mathbf{x}^{(j)}) + (1-\mathbf{y}^{(j)}) \log (1-f_{\theta}(\mathbf{x}^{(j)}))
$$
 
According to the conventional terminology, K-shot classification tasks use K input/output pairs from each class, for a total of $NK$ data points for N-way classification.
 
Given a distribution over tasks, these loss functions can be directly inserted into the equations in the previous section to perform meta-learning, as detailed in Algorithm 2.
[[File:ershad_alg2.png|500px|center|thumb]]
 
=== Reinforcement Learning ===
In reinforcement learning (RL), the goal of few-shot meta learning is to enable an agent to quickly acquire a policy for a new test task using only a small amount of experience in the test setting. A new task might involve achieving a new goal or succeeding on a previously trained goal in a new environment. For example, an agent may learn how to navigate mazes very quickly so that, when faced with a new maze, it can determine how to reliably reach the exit with only a few samples.
 
Each RL task $\mathcal{T}_i$ contains an initial state distribution $q_i(\mathbf{x}_1)$ and a transition distribution $q_i(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t)$ , and the loss $\mathcal{L}_{\mathcal{T}_i}$ corresponds to the (negative) reward function $R$. The entire task is therefore a Markov decision process (MDP) with horizon H, where the learner is allowed to query a limited number of sample trajectories for few-shot learning. Any aspect of the MDP may change across tasks in $p(\mathcal{T})$. The model being learned, $f_{\theta}$, is a policy that maps from states $\mathbf{x}_t$ to a distribution over actions $a_t$ at each timestep $t \in \{1,...,H\}$. The loss for task $\mathcal{T}_i$ and model $f_{\theta}$ takes the form
 
$$
\mathcal{L}_{\mathcal{T}_i}(f_{\theta}) = -\mathbb{E}_{\mathbf{x}_t,a_t \sim f_{\theta},q_{\mathcal{T}_i}} \big [\sum_{t=1}^H  R_i(\mathbf{x}_t,a_t)\big ]
$$
 
 
In K-shot reinforcement learning, K rollouts from $f_{\theta}$ and task $\mathcal{T}_i$, $(\mathbf{x}_1,a_1,...,\mathbf{x}_H)$, and the corresponding rewards $ R(\mathbf{x}_t,a_t)$, may be used for adaptation on a new task $\mathcal{T}_i$.
 
Since the expected reward is generally not differentiable due to unknown dynamics, we use policy gradient methods to estimate the gradient both for the model gradient update(s) and the meta-optimization. Since policy gradients are an on-policy algorithm, each additional gradient step during the adaptation of $f_{\theta}$ requires new samples from the current policy $f_{\theta_i}$ . We detail the algorithm in Algorithm 3, which has the same structure as Algorithm 2 but also which requires sampling trajectories from the environment corresponding to task $\mathcal{T}_i$ in steps 5 and 8. Here, a variety of improvements for policy gradient algorithm, including state or action-dependent baselines may also be used.
[[File:ershad_alg3.png|500px|center|thumb]]


='''Experiments'''=
='''Experiments'''=
[[File:MRL8.png|border|right|400px]]
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.


In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.
=== Regression ===
We start with a simple regression problem that illustrates the basic principles of MAML. Each task involves regressing from the input to the output of a sine wave, where the amplitude and phase of the sinusoid are varied between tasks. Thus, $p(\mathcal{T})$ is continuous, and the input and output both have a dimensionality of 1. During training and testing, datapoints are sampled uniformly. The loss is the mean-squared error between the prediction and true value. The regressor is a neural network model with 2 hidden layers of size 40 with ReLU nonlinearities. When training with MAML, we use one gradient update with K = 10 examples with a fixed step size 0.01, and use Adam as the metaoptimizer [7]. The baselines are likewise trained with Adam. To evaluate performance, we fine-tune a single meta-learned model on varying numbers of K examples, and compare performance to two baselines: (a) pre-training on all of the tasks, which entails training a network to regress to random sinusoid functions and then, at test-time, fine-tuning with gradient descent on the K provided points, using an automatically tuned step size, and (b) an oracle which receives the true amplitude and phase as input.


==Implementation==
We evaluate performance by fine-tuning the model learned by MAML and the pre-trained model on $K = \{ 5,10,20 \}$ datapoints. During fine-tuning, each gradient step is computed using the same $K$ datapoints. Results are shown in Fig 2.
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is  a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.


==Environments==


The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.
[[File:ershad_results1.png|500px|center|thumb|Figure 2: Few-shot adaptation for the simple regression task. Left: Note that MAML is able to estimate parts of the curve where there are no datapoints, indicating that the model has learned about the periodic structure of sine waves. Right: Fine-tuning of a model pre-trained on the same distribution of tasks without MAML, with a tuned step size. Due to the often contradictory outputs on the pre-training tasks, this model is unable to recover a suitable representation and fails to extrapolate from the small number of test-time samples.]]


The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.
=== Classification ===


The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.
For classification evaluation, Omniglot and MiniImagenet datasets are used. The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets.  


==Multitask Learning==
The experiment involves fast learning of N-way classification with 1 or 5 shots. The problem of N-way classification is set up as follows: select N unseen classes, provide the model with K different instances of each of the N classes and evaluate the model’s ability to classify new instances within the N classes. For Omniglot,  1200 characters are selected randomly for training, irrespective of the alphabet, and use the remaining for testing. The Omniglot dataset is augmented with rotations by multiples of 90 degrees.


[[File:MRL9.png|border|center|800px]]
The model follows the same architecture as the embedding function that has 4 modules with a 3-by-3 convolution and 64 filters, followed by batch normalization, a ReLU nonlinearity, and 2-by-2 max-pooling. The Omniglot images are downsampled to 28-by-28, so the dimensionality of the last hidden layer is 64. The last layer is fed into a softmax. For Omniglot, strided convolutions are used instead of max-pooling. For MiniImagenet, 32 filters per layer are used to reduce overfitting. In order to also provide a fair comparison against memory-augmented neural networks [7] and to test the flexibility of MAML, the results for a non-convolutional network are also provided.  
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:


# Structured hierarchical reinforcement learners:
[[File:ershad_results2.png|500px|center|thumb|Table 1: Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification and are not directly applicable to regression or RL scenarios. The $\pm$ shows 95% confidence intervals over tasks. ]]
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)
# Alternative ways of incorporating sketch data into standard policy gradient methods:
#* learning an '''independent''' policy for each task
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch


The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.
=== Reinforcement Learning ===
Several simulated continuous control environments are used for RL evaluation. In all of the domain, the MAML model is a neural network policy with two hidden layers of size 100, and ReLU activations. The gradient updates are computed using vanilla policy gradient and trust-region policy optimization (TRPO) is used as the meta-optimizer.


Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.
In order to avoid computing third derivatives, finite differences are computed to compute the Hessian-vector products for TRPO. For both learning and meta-learning updates, we use the standard linear feature baseline proposed by [8], which is fitted separately at each iteration for each sampled task in the batch.  


==Ablations==
Three baseline models for the comparison are:  
[[File:MRL10.png|border|right|400px]]
(a) pretraining one policy on all of the tasks and then fine-tuning
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.
(b) training a policy from randomly initialized weights
(c) an oracle policy which receives the parameters of the task as input, which for the tasks below corresponds to a goal position, goal direction, or goal velocity for the agent.  


To evaluate the the critic,  consider three ablations:
The baseline models of (a) and (b) are fine-tuned with gradient descent with a manually tuned step size.
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.


Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.
2D Navigation: In the first meta-RL experiment, the authors study a set of tasks where a point agent must move to different goal positions in 2D, randomly chosen for each task within a unit square. The observation is the current 2D position, and actions correspond to velocity commands clipped to be in the range [-0.1; 0.1]. The reward is the negative squared distance to the goal, and episodes terminate when the agent is within 0:01 of the goal or at the horizon ofH = 100. The policy was trained with MAML
to maximize performance after 1 policy gradient update using 20 trajectories. They compare the adaptation to a new task with up to 4 gradient updates, each with 40 samples. Results are shown in Fig. 3.


==Zero-shot and Adaptation Learning==
[[File:ershad_results3.png|500px|center|thumb|Figure 3: Top: quantitative results from 2D navigation task, Bottom: qualitative comparison between model learned with MAML and with fine-tuning from a pre-trained network ]]
[[File:MRL11.png|border|left|320px]]
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments,  the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.


='''Conclusion & Critique'''=
Locomotion. To study how well MAML can scale to more complex deep RL problems, we also study adaptation on high-dimensional locomotion tasks with the MuJoCo simulator [9]. The tasks require two simulated robots – a planar cheetah and a 3D quadruped (the “ant”) – to run in a particular direction or at a particular velocity. In the goal velocity experiments, the reward is the negative absolute value between the current velocity of the agent and a goal, which is chosen uniformly at random between 0 and 2 for the cheetah and between 0 and 3 for the ant. In the goal direction experiments, the reward is the magnitude of the velocity in either the forward or backward direction, chosen at random for each task in p(T ). The horizon is H = 200, with 20 rollouts per gradient step for all problems except the ant forward/backward task, which used 40 rollouts per step. The results in Figure 5 show that MAML learns a model that can quickly adapt its velocity and direction with even just a single gradient update, and continues to improve with more gradient steps. The results also show that, on these challenging tasks, the MAML initialization substantially outperforms random initialization and pretraining.
The paper's contributions are:
[[File:ershad_results4.png|1000px|center|thumb|Figure 4: Reinforcement learning results for the half-cheetah and ant locomotion tasks, with the tasks shown on the far right. ]]


* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.
A conceptual method to achieve fast adaptation in language modeling tasks ( not been experimented on by the authors) would be to explore methods of attaching an Attention Kernel which results in a simple and differentiable loss. It has been implemented in One-Shot Language Modeling along with state-of-the-art improvements in one-shot learning on Imagenet and Omniglot [11].


* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.
='''Conclusion'''=


They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.
The paper introduced a meta-learning method based on learning easily adaptable model parameters through gradient descent. The approach has a number of benefits. It is simple and does not introduce any learned parameters for meta-learning. It can be combined with any model representation that is amenable to gradient-based training, and any differentiable objective, including classification, regression, and reinforcement learning. Lastly, since the method merely produces a weight initialization, adaptation can be performed with any amount of data and any number of gradient steps, though it demonstrates state-of-the-art results on classification with only one or five examples per class. The authors also show that the method can adapt an RL agent using policy gradients and a very modest amount of experience. To conclude, it is evident that MAML is able to determine good model initializations for several tasks with a small number of gradient steps.


='''References'''=
[12] seems to be an interesting follow up on this paper, which tries to answer the fundamental questions with respect to meta learners, is it enough for MAML to only learn the initializations to perform well on the data where it is finally retrained on or representation ability is indeed lost from not learning the update rule.
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.
 
='''Critique'''=
From my opinion, the Model-Agnostic Meta-Learning looks like a simplified curriculum learning. The main idea in curriculum learning is to start with easier subtasks and while training the machine learning model increase the difficulty level of the tasks, gradually. It is motivated by the observation that humans and animals seem to learn better when trained with a curriculum like a strategy. However, this paper treats all tasks the same over the whole training history and does not consider the difficulty of the tasks and the adaption of the neural network to the task. Curriculum learning would be a good idea to speed up the training.


[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.
The paper doesn't qualify how different the individual tasks can be while building MAML initializer.


[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.
='''References'''=
# Schmidhuber, J¨urgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.
# Bengio, Samy, et al. "On the optimization of a synaptic learning rule." Preprints Conf. Optimality in Artificial and Biological Neural Networks. Univ. of Texas, 1992.
# Andrychowicz, Marcin, et al. "Learning to learn by gradient descent by gradient descent." Advances in Neural Information Processing Systems. 2016.
# Ravi, Sachin, and Hugo Larochelle. "Optimization as a model for few-shot learning." (2016).
# Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016.
# Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML Deep Learning Workshop. Vol. 2. 2015.
# Lake, Brenden M, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua B. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci), 2011.
# Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016.
# Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.
# Videos the learned policies can be found in https://sites.google.com/view/maml.
# Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra. "Matching Networks for One Shot Learning". arXiv:1606.04080 [cs.LG]
# https://openreview.net/pdf?id=HyjC5yWCW, under review ICLR 2018.


[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.


[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.


[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.


[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.


[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.
Implementation Example: https://github.com/cbfinn/maml

Latest revision as of 10:18, 4 December 2017

Introduction & Background

Learning quickly is a hallmark of human intelligence, whether it involves recognizing objects from a few examples or quickly learning new skills after just minutes of experience. Meta-learning is a subfield of machine learning where automatic learning algorithms are applied on meta-data about machine learning experiments. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In this work, we propose a meta-learning algorithm that is general and model-agnostic, in the sense that it can be directly applied to any learning problem and model that is trained with a gradient descent procedure. Our focus is on deep neural network models, but we illustrate how our approach can easily handle different architectures and different problem settings, including classification, regression, and policy gradient reinforcement learning, with minimal modification. Unlike prior meta-learning methods that learn an update function or learning rule [1,2,3,4], this algorithm does not expand the number of learned parameters nor place constraints on the model architecture (e.g. by requiring a recurrent model [5] or a Siamese network [6]), and it can be readily combined with fully connected, convolutional, or recurrent neural networks. It can also be used with a variety of loss functions, including differentiable supervised losses and nondifferentiable reinforcement learning objectives.

The primary contribution of this work is a simple model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task. The paper shows the effectiveness of the proposed algorithm in different domains, including classification, regression, and reinforcement learning problems.

Key Idea

Arguably, the biggest success story of transfer learning has been initializing vision network weights using pre-trained ImageNet. In particular, when approaching any new vision task, the well-known paradigm is to first collect labeled data for the task, acquire a network pre-trained on ImageNet classification, and then fine-tune the network on the collected data using gradient descent. Using this approach, neural networks can more effectively learn new image-based tasks from modestly-sized datasets. However, pre-training does not go very far, because, the last layers of the network still need to be heavily adapted to the new task, datasets that are too small, as in the few-shot setting, will still cause severe overfitting. Furthermore, we unfortunately don’t have an analogous pre-training scheme for non-vision domains such as speech, language, and control.

What if we directly optimized for an initial representation that can be effectively fine-tuned from a small number of examples? This is exactly the idea behind this paper, model-agnostic meta-learning (MAML). Like other meta-learning methods, MAML trains over a wide range of tasks. It trains for a representation that can be quickly adapted to a new task, via a few gradient steps. The meta-learner seeks to find an initialization that is not only useful for adapting to various problems but also can be adapted quickly (in a small number of steps) and efficiently (using only a few examples).

The key idea underlying this method is to train the model’s initial parameters such that the model has maximal performance on a new task after the parameters have been updated through one or more gradient steps computed with a small amount of data from that new task. This can be viewed from a feature learning standpoint as building an internal representation that is broadly suitable for many tasks. If the internal representation is suitable for many tasks, simply fine-tuning the parameters slightly (e.g. by primarily modifying the top layer weights in a feedforward model) can produce good results.

Model-Agnostic Meta Learning (MAML)

The goal of the proposed model is the rapid adaptation, which means learning a new function from only a few input/output pairs for that task, using prior data from similar tasks for meta-learning. This setting is usually formalized as few-shot learning.

Problem set-up

The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few data points and training iterations. To do so, the model is trained during a meta-learning phase on a set of tasks, such that it can then be adapted to a new task using only a small number of parameter updates. In effect, the meta-learning problem treats entire tasks as training examples.

Let us consider a model denoted by $f$, that maps the observation $\mathbf{x}$ into the output variable $a$. During meta-learning, the model is trained to be able to adapt to a large or infinite number of tasks.

Let us consider a generic notion of task as below. Each task $\mathcal{T} = \{\mathcal{L}(\mathbf{x}_1,a_1,\mathbf{x}_2,a_2,..., \mathbf{x}_H,a_H), q(\mathbf{x}_1),q(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t),H \}$, consists of a loss function $\mathcal{L}$, a distribution over initial observations $q(\mathbf{x}_1)$, a transition distribution $q(\mathbf{x}_{t+1}|\mathbf{x}_t)$, and an episode length $H$. In i.i.d. supervised learning problems, the length $H =1$. The model may generate samples of length $H$ by choosing an output at at each time $t$. The cost $\mathcal{L}$ provides a task-specific feedback, which is defined based on the nature of the problem.

A distribution over tasks is denoted by $p(\mathcal{T})$. In the K-shot learning setting, the model is trained to learn a new task $\mathcal{T}_i$ drawn from $p(\mathcal{T})$ from only K samples drawn from $q_i$ and feedback $\mathcal{L}_{\mathcal{T}_i}$ generated by $\mathcal{T}_i$. During meta-training, a task $\mathcal{T}_i$ is sampled from $p(\mathcal{T})$, the model is trained with K samples and feedback from the corresponding loss $\mathcal{L}(\mathcal{T}_i)$ from $\mathcal{T}_i$, and then tested on new samples from $T_i$. The model $f$ is then improved by considering how the test error on new data from $q_i$ changes with respect to the parameters. In effect, the test error on sampled tasks $\mathcal{T}_i$ serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from $p(\mathcal{T})$, and meta-performance is measured by the model’s performance after learning from K samples. Notice that tasks used for meta-testing are held out during meta-training.

MAML Algorithm

Figure 1: Diagram of the MAML algorithm

The paper proposes a method that can learn the parameters of any standard model via meta-learning in such a way as to prepare that model for fast adaptation. The intuition behind this approach is that some internal representations are more transferable than others. Since the model will be fine-tuned using a gradient-based learning rule on a new task, we will aim to learn a model in such a way that this gradient-based learning rule can make rapid progress on new tasks drawn from $p(\mathcal{T})$, without overfitting. In effect, we will aim to find model parameters that are sensitive to changes in the task, such that small changes in the parameters will produce large improvements on the loss function of any task drawn from $p(\mathcal{T})$. Fig. 1 is a visualization of MMAML algorithm – suppose we are seeking to find a set of parameters $\theta$ that are highly adaptable. During the course of meta-learning (the bold line), MAML optimizes for a set of parameters such that when a gradient step is taken with respect to a particular task $i$ (the gray lines), the parameters are close to the optimal parameters $θ^∗_i$ for task $i$.

Note that there is no assumption about the form of the model. The only assumption is that it is parameterized by a vector of parameters $\theta$, and the loss is smooth so that the parameters can be leaned using gradient-based techniques. Formally let us assume that the model is denoted by $f_{\theta}$. When adapting to a new task $\mathcal{T}_i $, the model’s parameters $\theta$ become $\theta_i'$. In our method, the updated parameter vector $\theta_i'$ is computed using one or more gradient descent updates on task $\mathcal{T}_i $. For example, when using one gradient update:

$$ \theta_i ' = \theta - \alpha \nabla_{\theta }\mathcal{L}_{\mathcal{T}_i}(f_{\theta}). $$

Here $\alpha$ is the learning rate (or the step size) of each task and considered as a hyperparameter. They consider a single step of an update for the rest of the paper, for the sake of the simplicity.

The model parameters are trained by optimizing for the performance of $f_{\theta_i'}$ with respect to $\theta$ across tasks sampled from $p(\mathcal{T})$. More concretely, the meta-objective is as follows:

$$ \min_{\theta} \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'}) = \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta - \alpha \nabla_{\theta }\mathcal{L}_{\mathcal{T}_i}(f_{\theta})}) $$

Note that the meta-optimization is performed over the model parameters $\theta$, whereas the objective is computed using the updated model parameters $\theta'$. The model aims to optimize the model parameters such that one or a small number of gradient step on a new task will produce maximally effective behavior on that task.

Therefore the meta-learning across the tasks is performed via stochastic gradient descent (SGD), such that the model parameters $\theta$ are updated as:

$$ \theta \gets \theta - \beta \nabla_{\theta } \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'}) $$ where $\beta$ is the meta step size. Outline of the algorithm is shown in Algorithm 1.

The MAML meta-gradient update involves a gradient through a gradient. Computationally, this requires an additional backward pass through f to compute Hessian-vector products, which is supported by standard deep learning libraries such as TensorFlow.

Different Types of MAML

In this section, the MAML algorithm is discussed for different supervised learning and reinforcement learning tasks. The differences between each of these tasks are in their loss function and the way the data is generated. In general, this method does not require additional model parameters nor using any additional meta-learner to learn the update of parameters. Compared to other approaches that tend to “learn to compare new examples in a learned metric space using e.g. Siamese networks or recurrence with attention mechanisms”, the proposed method can be generalized to any other problems including classification, regression and reinforcement learning.

Supervised Regression and Classification

Few-shot learning is well-studied in this field. For these two types of tasks, the horizon $H$ is equal to 1, since the model accepts a single input and produces a single output, rather than a sequence of inputs and outputs. The task ${\mathcal{T}_i}$ generates $K$ i.i.d. observations $x$ from $q_i$, and the task loss is represented by the error between the model’s output for x and the corresponding target values y for that observation and task

Although any common classification and regression objectives can be used as the loss, the paper uses the following losses for these two tasks.

Regression : For regression we use the mean-square error (MSE):

$$ \mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \parallel f_{\theta} (\mathbf{x}^{(j)}) - \mathbf{y}^{(j)}\parallel_2^2 $$

where $\mathbf{x}^{(j)}$ and $\mathbf{y}^{(j)}$ are the input/output pair sampled from task $\mathcal{T}_i$. In K-shot regression tasks, K input/output pairs are provided for learning for each task.

Classification: For classification we use the cross entropy loss:

$$ \mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \mathbf{y}^{(j)} \log f_{\theta}(\mathbf{x}^{(j)}) + (1-\mathbf{y}^{(j)}) \log (1-f_{\theta}(\mathbf{x}^{(j)})) $$

According to the conventional terminology, K-shot classification tasks use K input/output pairs from each class, for a total of $NK$ data points for N-way classification.

Given a distribution over tasks, these loss functions can be directly inserted into the equations in the previous section to perform meta-learning, as detailed in Algorithm 2.

Reinforcement Learning

In reinforcement learning (RL), the goal of few-shot meta learning is to enable an agent to quickly acquire a policy for a new test task using only a small amount of experience in the test setting. A new task might involve achieving a new goal or succeeding on a previously trained goal in a new environment. For example, an agent may learn how to navigate mazes very quickly so that, when faced with a new maze, it can determine how to reliably reach the exit with only a few samples.

Each RL task $\mathcal{T}_i$ contains an initial state distribution $q_i(\mathbf{x}_1)$ and a transition distribution $q_i(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t)$ , and the loss $\mathcal{L}_{\mathcal{T}_i}$ corresponds to the (negative) reward function $R$. The entire task is therefore a Markov decision process (MDP) with horizon H, where the learner is allowed to query a limited number of sample trajectories for few-shot learning. Any aspect of the MDP may change across tasks in $p(\mathcal{T})$. The model being learned, $f_{\theta}$, is a policy that maps from states $\mathbf{x}_t$ to a distribution over actions $a_t$ at each timestep $t \in \{1,...,H\}$. The loss for task $\mathcal{T}_i$ and model $f_{\theta}$ takes the form

$$ \mathcal{L}_{\mathcal{T}_i}(f_{\theta}) = -\mathbb{E}_{\mathbf{x}_t,a_t \sim f_{\theta},q_{\mathcal{T}_i}} \big [\sum_{t=1}^H R_i(\mathbf{x}_t,a_t)\big ] $$


In K-shot reinforcement learning, K rollouts from $f_{\theta}$ and task $\mathcal{T}_i$, $(\mathbf{x}_1,a_1,...,\mathbf{x}_H)$, and the corresponding rewards $ R(\mathbf{x}_t,a_t)$, may be used for adaptation on a new task $\mathcal{T}_i$.

Since the expected reward is generally not differentiable due to unknown dynamics, we use policy gradient methods to estimate the gradient both for the model gradient update(s) and the meta-optimization. Since policy gradients are an on-policy algorithm, each additional gradient step during the adaptation of $f_{\theta}$ requires new samples from the current policy $f_{\theta_i}$ . We detail the algorithm in Algorithm 3, which has the same structure as Algorithm 2 but also which requires sampling trajectories from the environment corresponding to task $\mathcal{T}_i$ in steps 5 and 8. Here, a variety of improvements for policy gradient algorithm, including state or action-dependent baselines may also be used.

Experiments

Regression

We start with a simple regression problem that illustrates the basic principles of MAML. Each task involves regressing from the input to the output of a sine wave, where the amplitude and phase of the sinusoid are varied between tasks. Thus, $p(\mathcal{T})$ is continuous, and the input and output both have a dimensionality of 1. During training and testing, datapoints are sampled uniformly. The loss is the mean-squared error between the prediction and true value. The regressor is a neural network model with 2 hidden layers of size 40 with ReLU nonlinearities. When training with MAML, we use one gradient update with K = 10 examples with a fixed step size 0.01, and use Adam as the metaoptimizer [7]. The baselines are likewise trained with Adam. To evaluate performance, we fine-tune a single meta-learned model on varying numbers of K examples, and compare performance to two baselines: (a) pre-training on all of the tasks, which entails training a network to regress to random sinusoid functions and then, at test-time, fine-tuning with gradient descent on the K provided points, using an automatically tuned step size, and (b) an oracle which receives the true amplitude and phase as input.

We evaluate performance by fine-tuning the model learned by MAML and the pre-trained model on $K = \{ 5,10,20 \}$ datapoints. During fine-tuning, each gradient step is computed using the same $K$ datapoints. Results are shown in Fig 2.


Figure 2: Few-shot adaptation for the simple regression task. Left: Note that MAML is able to estimate parts of the curve where there are no datapoints, indicating that the model has learned about the periodic structure of sine waves. Right: Fine-tuning of a model pre-trained on the same distribution of tasks without MAML, with a tuned step size. Due to the often contradictory outputs on the pre-training tasks, this model is unable to recover a suitable representation and fails to extrapolate from the small number of test-time samples.

Classification

For classification evaluation, Omniglot and MiniImagenet datasets are used. The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets.

The experiment involves fast learning of N-way classification with 1 or 5 shots. The problem of N-way classification is set up as follows: select N unseen classes, provide the model with K different instances of each of the N classes and evaluate the model’s ability to classify new instances within the N classes. For Omniglot, 1200 characters are selected randomly for training, irrespective of the alphabet, and use the remaining for testing. The Omniglot dataset is augmented with rotations by multiples of 90 degrees.

The model follows the same architecture as the embedding function that has 4 modules with a 3-by-3 convolution and 64 filters, followed by batch normalization, a ReLU nonlinearity, and 2-by-2 max-pooling. The Omniglot images are downsampled to 28-by-28, so the dimensionality of the last hidden layer is 64. The last layer is fed into a softmax. For Omniglot, strided convolutions are used instead of max-pooling. For MiniImagenet, 32 filters per layer are used to reduce overfitting. In order to also provide a fair comparison against memory-augmented neural networks [7] and to test the flexibility of MAML, the results for a non-convolutional network are also provided.

Table 1: Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification and are not directly applicable to regression or RL scenarios. The $\pm$ shows 95% confidence intervals over tasks.

Reinforcement Learning

Several simulated continuous control environments are used for RL evaluation. In all of the domain, the MAML model is a neural network policy with two hidden layers of size 100, and ReLU activations. The gradient updates are computed using vanilla policy gradient and trust-region policy optimization (TRPO) is used as the meta-optimizer.

In order to avoid computing third derivatives, finite differences are computed to compute the Hessian-vector products for TRPO. For both learning and meta-learning updates, we use the standard linear feature baseline proposed by [8], which is fitted separately at each iteration for each sampled task in the batch.

Three baseline models for the comparison are: (a) pretraining one policy on all of the tasks and then fine-tuning (b) training a policy from randomly initialized weights (c) an oracle policy which receives the parameters of the task as input, which for the tasks below corresponds to a goal position, goal direction, or goal velocity for the agent.

The baseline models of (a) and (b) are fine-tuned with gradient descent with a manually tuned step size.

2D Navigation: In the first meta-RL experiment, the authors study a set of tasks where a point agent must move to different goal positions in 2D, randomly chosen for each task within a unit square. The observation is the current 2D position, and actions correspond to velocity commands clipped to be in the range [-0.1; 0.1]. The reward is the negative squared distance to the goal, and episodes terminate when the agent is within 0:01 of the goal or at the horizon ofH = 100. The policy was trained with MAML to maximize performance after 1 policy gradient update using 20 trajectories. They compare the adaptation to a new task with up to 4 gradient updates, each with 40 samples. Results are shown in Fig. 3.

Figure 3: Top: quantitative results from 2D navigation task, Bottom: qualitative comparison between model learned with MAML and with fine-tuning from a pre-trained network

Locomotion. To study how well MAML can scale to more complex deep RL problems, we also study adaptation on high-dimensional locomotion tasks with the MuJoCo simulator [9]. The tasks require two simulated robots – a planar cheetah and a 3D quadruped (the “ant”) – to run in a particular direction or at a particular velocity. In the goal velocity experiments, the reward is the negative absolute value between the current velocity of the agent and a goal, which is chosen uniformly at random between 0 and 2 for the cheetah and between 0 and 3 for the ant. In the goal direction experiments, the reward is the magnitude of the velocity in either the forward or backward direction, chosen at random for each task in p(T ). The horizon is H = 200, with 20 rollouts per gradient step for all problems except the ant forward/backward task, which used 40 rollouts per step. The results in Figure 5 show that MAML learns a model that can quickly adapt its velocity and direction with even just a single gradient update, and continues to improve with more gradient steps. The results also show that, on these challenging tasks, the MAML initialization substantially outperforms random initialization and pretraining.

Figure 4: Reinforcement learning results for the half-cheetah and ant locomotion tasks, with the tasks shown on the far right.

A conceptual method to achieve fast adaptation in language modeling tasks ( not been experimented on by the authors) would be to explore methods of attaching an Attention Kernel which results in a simple and differentiable loss. It has been implemented in One-Shot Language Modeling along with state-of-the-art improvements in one-shot learning on Imagenet and Omniglot [11].

Conclusion

The paper introduced a meta-learning method based on learning easily adaptable model parameters through gradient descent. The approach has a number of benefits. It is simple and does not introduce any learned parameters for meta-learning. It can be combined with any model representation that is amenable to gradient-based training, and any differentiable objective, including classification, regression, and reinforcement learning. Lastly, since the method merely produces a weight initialization, adaptation can be performed with any amount of data and any number of gradient steps, though it demonstrates state-of-the-art results on classification with only one or five examples per class. The authors also show that the method can adapt an RL agent using policy gradients and a very modest amount of experience. To conclude, it is evident that MAML is able to determine good model initializations for several tasks with a small number of gradient steps.

[12] seems to be an interesting follow up on this paper, which tries to answer the fundamental questions with respect to meta learners, is it enough for MAML to only learn the initializations to perform well on the data where it is finally retrained on or representation ability is indeed lost from not learning the update rule.

Critique

From my opinion, the Model-Agnostic Meta-Learning looks like a simplified curriculum learning. The main idea in curriculum learning is to start with easier subtasks and while training the machine learning model increase the difficulty level of the tasks, gradually. It is motivated by the observation that humans and animals seem to learn better when trained with a curriculum like a strategy. However, this paper treats all tasks the same over the whole training history and does not consider the difficulty of the tasks and the adaption of the neural network to the task. Curriculum learning would be a good idea to speed up the training.

The paper doesn't qualify how different the individual tasks can be while building MAML initializer.

References

  1. Schmidhuber, J¨urgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.
  2. Bengio, Samy, et al. "On the optimization of a synaptic learning rule." Preprints Conf. Optimality in Artificial and Biological Neural Networks. Univ. of Texas, 1992.
  3. Andrychowicz, Marcin, et al. "Learning to learn by gradient descent by gradient descent." Advances in Neural Information Processing Systems. 2016.
  4. Ravi, Sachin, and Hugo Larochelle. "Optimization as a model for few-shot learning." (2016).
  5. Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016.
  6. Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML Deep Learning Workshop. Vol. 2. 2015.
  7. Lake, Brenden M, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua B. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci), 2011.
  8. Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016.
  9. Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.
  10. Videos the learned policies can be found in https://sites.google.com/view/maml.
  11. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra. "Matching Networks for One Shot Learning". arXiv:1606.04080 [cs.LG]
  12. https://openreview.net/pdf?id=HyjC5yWCW, under review ICLR 2018.



Implementation Example: https://github.com/cbfinn/maml