Depthwise Convolution Is All You Need for Learning Multiple Visual Domains

From statwiki
Jump to navigation Jump to search

Presented by

Yuwei Liu, Daniel Mao

Introduction

This paper propose a multi-domain learning architecture based on depthwise separable convolu- tion, which is based on the assumption that images from different domains share cross-channel correlations but have domain-specific spatial correlations. The proposed model is compact and has minimal overhead when being applied to new domains. Additionally, we introduce a gat- ing mechanism to promote soft sharing between different domains. The approach was evalueated on Visual Decathlon Challenge, and it showed that the approach can achieve the highest score while only requiring 50% of the parameters compared with the state-of-the-art approaches.

Motivation

Can we build a single neural network that can deal with images across different domains? This question motivates the concept of "multi-domain learning", and there are two challenges in multi-domain learning:

1.Identify a common structure among different domains.

2.Add new tasks to the model without introducing additional parameters.

Previous Work

1. Multi-Domain Learningaims at creating a single neural network to perform image classification tasks in a variety of domains. (Bilen and Vedaldi 2017) showed that a single neural network can learn simultaneously several different visual domains by using an instance normalization layer. (Rebuffi, Bilen, and Vedaldi 2017; 2018) proposed universal parametric families of neural networks that contain specialized problem-specific models which differ only by a small number of parameters. (Rosenfeld and Tsotsos 2018) proposed a method called Deep Adaptation Networks (DAN) that constrains newly learned filters for new domains to be linear combinations of existing ones.

2. Multi-Task Learning (Doersch and Zisserman 2017; Kokkinos 2017) is to extract different features from a single input to simultaneously perform classification, object recognition, edge detection, etc.

3. Transfer Learning is to improve the performance of a model on a target domain by leveraging the information from a related source domain (Pan, Yang, and others 2010; Bengio 2012; Hu, Lu, and Tan 2015).

Model Architecture

Depthwise Separable Convolution Proposed approach is based on depthwise separable convolution that factorizes a standard 3 × 3 convolution into a 3 × 3 depthwise convolution and a 1 × 1 pointwise convolution. While standard convolution performs the channel-wise and spatial-wise computation in one step, depthwise separable convolution splits the computation into two steps: depthwise convolution applies a single convolutional filter per each input channel and pointwise convolution is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown:

Depthwise convolution and pointwise convolution have different roles in generating new features: the former is used for capturing spatial correlations while the latter is used for capturing channel-wise correlations.

Network Architecture For the experiments, we use the same ResNet-26 architecture as in (Rebuffi, Bilen, and Vedaldi 2018). The original architecture has three macro residual blocks, each outputting 64, 128, 256 feature channels. Each macro block consists of 4 residual blocks. Each residual block has two convolutional layers consisting of 3 × 3 convolutional filters. The network ends with a global average pooling layer and a softmax layer for classification. Each standard convolution in the ResNet-26 was replaced with depthwise separable convolution and increase the channel size. The modified network architecture is shown in Fig. 2.

File:Standard convolution and depthwise separable convolution.png|300px

The reduction of parameters does no harm to the performance of the model and the use of depthwise separable convolution allows us to model cross-channel correlations and spatial correlations separately. The idea behind our multi-domain learning method is to leverage the different roles of cross-channel correlations and spatial correlations in generating image features by sharing the pointwise convolution across different domains.

Learning Multiple Domains For multi-domain learning,it is essential to have a set of universally sharable parameters that can generalize to unseen domains. To get a good starting set of parameters, first train the modified ResNet-26 on ImageNet, then obtain a well-initialized network, each time when a new domain arrives, add a new output layer and finetune the depth-wise convolutional filters. The pointwise convolutional filters are shared across different domains. Since the statistics of theimages from different domains are different, domain-specific batch normalization parameters are also allowed. During inference, stack the trained depthwise convolutional filters for all domains as a 4D tensor. The adoption of depthwise separable convolution provides a natural separation for modeling cross-channel correlations and spatial correlations. Experimental evidence (Chollet 2017) suggests the decouple of cross-channel correlations and spatial correlations would result in more useful features. Take one step further to develop a multi-domain domain method based on the assumption that different domains share cross-channel correlations but have domain-specific spatial correlations. The method is based on two observations: model efficiency and interpretability of hidden units in a deep neural network.

Soft Sharing of Trained Depthwise Filters In addition to the proposed sharing pointwi paper we also investigate whether the depthwise filters (spatial correlations) learned from other domains can be transferred to the target domain. Hence a novel soft sharing approach is introduced in the multi-domain setting to allow the sharing of depthwise convolution. after training domain-specific depthwise filters, all the domain-specific filters are stacked as in Fig 4. s in Fig 4.

During soft-sharing, we train each domain one by one. All the domain-specific depthwise filters and pointwise filters (trained on ImageNet) are fixed during soft sharing. We only train the feedforward network that controls the softmax gate. For a specific target domain, the softmax gate allows a soft sharing of trained depthwise filters with other domains. It is widely believed that early layers in a convolutional neural network are used for detecting lower level features such as textures while later layers are used for detecting parts or objects. Based on this observation, we partition the net- work into three regions (early, middle, late) as shown in Fig- ure 2 and consider different placement of the softmax gate which allows us to compare a variety of sharing strategies.

Experiments

The authors conducted two experiments to test the theories: the first was based on simulation, and the second used the data CIFAR10.

There are two activations used in the simulation: well-specified under-parametrized logistic regression as well as general convex ERM with the under-confident activation [math]\displaystyle{ \sigma_{underconf} }[/math]. The “calibration curves” were plotted for both activations: the x-axis is p, the y-axis is the average probability given the prediction.

The figure above shows four main results: First, the logistic regression is over-confident at all [math]\displaystyle{ \kappa }[/math]. Second, over-confidence is more severe when [math]\displaystyle{ \kappa }[/math] increases, suggests the conclusion of the theory holds more broadly than its assumptions. Third, [math]\displaystyle{ \sigma_{underconf} }[/math] leads to under-confidence for [math]\displaystyle{ p \in (0.5, 0.51) }[/math], which verifies Theorem 2 and Corollary 3. Finally, theoretical prediction closely matches the simulation, further confirms the theory.

The generality of the theory beyond the Gaussian input assumption and the binary classification setting was further tested using dataset CIFAR10 by running multi-class logistic regression on the first five classes on it. The author performed logistic regression on two kinds of labels: true label and pseudo-label generated from the multi-class logistic (softmax) model.

The figure above indicates that the logistic regression is over-confident on both labels, where the over-confidence is more severe on the pseudo-labels than the true labels. This suggests the result that logistic regression is inherently over-confident may hold more broadly for other under-parametrized problems without strong assumptions on the input distribution, or even when the labels are not necessarily realizable by the model.

Conclusion

A multi-domain learning approach is proposed based on depthwise separable convolution and the assumption that images from different domains share the same channel-wise correlation but have domain-specific spatial-wise correlation. By evaluating on Visual Decathlon Challenge, the model achieves the highest score. The visualization results reveal that depthwise convolution captures more attributes and higher level concepts than pointwise convolution.

References