# Introduction

Typically, the basic goal of machine learning is to train a model to perform a task. In Meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence in this case the term "Meta-Learning" has the exact meaning you would expect; the word "Meta" has the precise function of introducing a layer of abstraction.

The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of Meta-Learning here is to design and train and single binary classifier that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:

1) A task is sampled. The task is "Is X a dog?"

2) A small set of labeled training data is provided to the model. The labels represent whether or not the image is a picture of a dog.

3) The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.

This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.

In this paper, a probabilistic framework for meta learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning set up using Markov Decision Processes.

# Model Agnostic Meta-Learning

An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017)

"In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task" (Finn et al, 2017)

In this training algorithm, the parameter vector $\theta$ belonging to the model $f_{\theta}$ is trained such that the meta-objective function $\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' })$ is minimized. The sum in the objective function is over a sampled batch of training tasks. $\mathcal{L}_{\tau_i} (f_{\theta_i'})$ is the training loss function corresponding to the $i^{th}$ task in the batch evaluated at the model $f_{\theta_i'}$. The parameter vector $\theta_i'$ is obtained by updating the general parameter $\theta$ using the loss function $\mathcal{L}_{\tau_i}$ and set of K training examples specific to the $i^{th}$ task. Note that in alternate versions of this algorithm, additional testing sets are sampled from $\tau_i$ and used to update $\theta$ using testing loss functions instead of training loss functions.

One of the important difference between this algorithm and more typical fine-tuning methods is that $\theta$ is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007)

# Probabilistic Framework for Meta-Learning

This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL. Instead of training examples $\{(x, y)\}$, we have trajectories $(x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)$. A trajectory is sequence of states/observations $x_t$, actions $a_t$ and rewards $R_t$ that is sampled from a task $T$ according to a policy $\pi_{\theta}$. Included with said task is a method for assigning loss values to trajectories $L_T(\tau)$ which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy $\pi_{\theta}$ with parameter vector $\theta$. This is analougous to training a function $f_{\theta}$ that assigns labels $y$ to feature vectors $x$. More precisely we have the following definitions:

• $T :=(L_T, P_T(x_0), P_T(x_t | x_{t-1}, a_t), H )$ (A Task)
• $D(T)$ : A distribution over tasks.
• $L_T$: A loss function for the task T that assigns numeric loss values to trajectories.
• $P_T(x_0), P_T(x_t | x_{t-1}, a_t)$: Probability measures specifying the markovian dynamics of the observations $x_t$
• $H$: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.

The papers goes further to define a Markov dynamic for sequences of tasks. Thus the policy that we would like to meta learn $\pi_{\theta}$, after being exposed to a sample of K trajectories $\tau_T^{1:k}$ from the task $T_i$, should produce a new policy $\pi_{\phi}$ that will perform well on the next task $T_{i+1}$. Thus we seek to minimize the following expectation.

$\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)$

Where $\mathcal{L}_{T_i}(\theta) = \mathrm{E}_{\tau_i^{1:k} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi}) \Big) \bigg)$ and $l$ is the number of tasks.

# Source

1. Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).
2. Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.