Unsupervised Domain Adaptation with Residual Transfer Networks

From statwiki
Jump to navigation Jump to search

Introduction

Domain Adaptation [1]is a problem in machine learning which involves taking a model which has been trained on a source domain, and applying this to a different (but related) target domain. Unsupervised domain adaptation refers to the situation in which the source data is labelled, while the target data is (predominantly) unlabeled. The problem at hand is then finding ways to generalize the learning on the source domain to the target domain. In the age of deep networks this problem has become particularly salient due to the need for vast amounts of labeled training data, in order to reap the benefits of deep learning. Manual generation of labeled data is often prohibitive, and in absence of such data networks are rarely performant. The attempt to circumvent this drought of data typically necessitates the gathering of "off-the-shelf" data sets, which are tangentially related and contain labels, and then building models in these domains. The fundamental issue that unsupervised domain adaptation attempts to address is overcoming the inherent shift in distribution across the domains, without the ability to observe this shift directly.

This paper proposes a method for unsupervised domain adaptation which relies on three key components:

  1. A kernel-based penalty to ensure that the abstract representations generated by the networks hidden layers are similar between the source and the target data;
  2. An entropy based penalty on the target classifier, which exploits the entropy minimization principle; and
  3. A residual network structure is appended, which allows the source and target classifiers to differ by a (learned) residual function, thus relaxing the shared classifier assumption which is traditionally made.

This method outperforms state-of-the-art techniques on common benchmark datasets, and is flexible enough to be applied in most feed-forward neural networks.

The Office-31 Dataset Images for Backpack. Shows the variation in the source and target domains to motivate why these methods are important.

Working Example (Office-31)

In order to assist in the understanding of the methods, it is helpful to have a tangible sense of the problem front of mind. The Domain Adaptation Project [2] provides data sets which are tailored to the problem of unsupervised domain adaptation. One of these data sets (which is later used in the experiments of this paper) has images which are labeled based on the Amazon product page for the various items. There are then corresponding pictures taken either by webcams or digital SLR cameras. The goal of unsupervised domain adaptation on this data set would be to take any of the three image sources as the source domain, and transfer a classifier to the other domain; see the example images to understand the differences.

One can imagine that, while it is likely easy to scrape labeled images from Amazon, it is likely far more difficult to collect labeled images from webcam or DSLR pictures directly. The ultimate goal of this method would be to train a model to recognize a picture of a backpack taken with a webcam, based on images of backpacks scraped from Amazon (or similar tasks).

Related Work

Residual Transfer Networks

Generally, in an unsupervised domain adaptation problem, we are dealing with a set $\mathcal{D}_s$ (called the source domain) which is defined by $\{(x_i^s, y_i^s)\}_{i=1}^{n_s}$. That is the set of all labeled input-output pairs in our source data set. We denote the number of source elements by $n_s$. There is a corresponding set $\mathcal{D}_t = \{(x_i^t)\}_{i=1}^{n_t}$ (the target domain), consisting of unlabeled input values. There are $n_t$ such values.

We can think of $\mathcal{D}_s$ as being sampled from some underlying distribution $p$, and $\mathcal{D}_t$ as being sampled from $q$. Generally we have that $p \neq q$, partially motivating the need for domain adaptation methods.

We can consider the classifiers $f_s(\underline{x})$ and $f_t(\underline{x})$, for the source domain and target domain respectively. It is possible to learn $f_s$ based on the sample $\mathcal{D}_s$. Under the shared classifier assumption it would be the case that $f_s(\underline{x}) = f_t(\underline{x})$, and thus learning the source classifier is enough. This method relaxes this assumption, assuming that in general $f_s \neq f_t$, and attempting to learn both.

The example network extends deep convolutional networks (in this case AlexNet [3]) to Residual Transfer Networks, the mechanics of which are outlined below. Recall that, if $L(\cdot, \cdot)$ is taken to be the cross-entropy loss function, then the empirical error of a CNN on the source domain $\mathcal{D}_s$ is given by:

[math]\displaystyle{ \min_{f_s} \frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s) }[/math]

In a standard implementation, the CNN optimizes over the above loss. This will be the starting point for the RTN.

Feature Adaptation

Classifier Adaptation

Residual Transfer Network

Experiments

Set-up

Results

Discussion

Conclusion

Critique