When Does Self-Supervision Improve Few-Shot Learning?

From statwiki
Revision as of 01:19, 8 November 2020 by A4moayye (talk | contribs)
Jump to navigation Jump to search

Presented by

Arash Moayyedi

Introduction

This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

Previous Work

This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

Method

The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by [math]\displaystyle{ \mathcal{D}_s }[/math]. Similarly, the domain of the images used for the self-supervised tasks is shown by [math]\displaystyle{ \mathcal{D}_{ss} }[/math]. This paper also analyzes the effects of having [math]\displaystyle{ \mathcal{D}_s = \mathcal{D}_{ss} }[/math] versus [math]\displaystyle{ \mathcal{D}_s \neq \mathcal{D}_{ss} }[/math] on the accuracy of the final few-shot learning task.

The input is connected to a feed-forward convolutional network [math]\displaystyle{ f(x) }[/math] and it is the shared backbone between the classifier [math]\displaystyle{ g }[/math] and the self-supervised target predictor [math]\displaystyle{ h }[/math]. The classification loss [math]\displaystyle{ \mathcal{L}_s }[/math] and the task prediction loss [math]\displaystyle{ \mathcal{L}_{ss} }[/math] are written as:


[math]\displaystyle{ \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), }[/math]

[math]\displaystyle{ \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). }[/math]


The final loss is [math]\displaystyle{ \mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss} }[/math], and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case [math]\displaystyle{ \mathcal{D}_s \neq \mathcal{D}_{ss} }[/math], a forward pass is done on a batch per each dataset, and the two losses are combined.